parsing 75,000+ lines of text?

meeble's icon

heya,

Max Patch
Copy patch and select New From Clipboard in Max.

I have this part of my patch that I wrote to parse many lines of text, one line at a time to fill a matrix with that data. Everything works fine when I am parsing 10,000 or so lines of text. I tried it with 75,000+ lines of text, and it crashes Max. I'm sure my design is the problem. Any help would be greatly appreciated. Thanks.

leafcutter's icon

You could try a defer or deferlow object on the uzi or if that does not work, you could split the text file into 10,000 line chunks.

seejayjames's icon

Not sure...I wouldn't use a line number and a bang from [uzi], just send the output line from [text] right into the [route] for processing. [uzi] is so fast that it can create weird bottlenecks (or something...) which sometimes needs a [defer] or [deferlow] placed in the path of its destruction. So, eliminate the double-output and maybe experiment with [defer] or [deferlow] after the right outlet of [uzi]. My guess is that this is where the problem is.

Put a [deferlow] after your bang to the [uzi] as well, so the "query" message has time to output. It might need a few more msec to determine the number of lines in a huuuuuge file. So, if you bang the [uzi] prematurely, it may start asking for lots of lines which don't exist, or something...might be a recipe for trouble.

Because your application isn't super time-sensitive (you don't need it to do this every second or anything), allowing extra time at these critical steps might solve the issue.

Unless [text] has a size limit....? If so, then there's the issue. But I imagine it can handle files based on your available RAM, no?

Roman Thilenius's icon

i bet the text object has a size limit

mattyo's icon

I'm betting with seejay on this one. I recently did a project with the entire Old Testament in a text object...

M

leafcutter's icon
Max Patch
Copy patch and select New From Clipboard in Max.

you can happily store a couple of million lines in the text object.

meeble's icon
Max Patch
Copy patch and select New From Clipboard in Max.

it seems loading data into the text object isn't the issue, as reading a 200,000 line text file is almost instantaneous. As leafcutter's patch demonstrates, filling a text object with 1,000,000 lines is rather fast. I have modified this part of my patch - replacing the uzi with a metro/counter. It no longer crashes, but it stills slows down substantially @ around line 10,000. I'm not sure why. It takes me 2 minutes and 19 seconds to parse a 49,382 line text file with this version of the patch. I'm also attaching that text file.

5036.dode4b.txt.zip
zip
seejayjames's icon

Same here...slows down after 10000 or thereabouts and keeps getting slower gradually. Tried with [uzi] and had to force quit due to the long spinning rainbow wheel of death.

I wonder if those [sends] are adding to the overhead...can you patch directly instead somehow? Also maybe [route set] is faster than [zl filter set], because it doesn't have to process the whole line, but in testing it seemed about the same.

If you could remove all the "+" in a text editor first, which would be easy there, that would eliminate the [regexp] too. I'm sure that would speed things up a bit.

I know some objects (like [sprintf]) retain a memory of all the data that's passed through them, don't they? Do any of your patch objects do this? If so, that would be a place to alter things for sure...

meeble's icon

hi seejayjames,

Ok, this version connects directly to the jit.fill object, which seems to slow things down even more. I took your advice and preprocessed the text file to remove the + characters. From my testing, the use of regex and zl.filter objects don't affect the slowdown much. I also removed the send and receive objects, but that doesn't affect the slowdown at all.

Max Patch
Copy patch and select New From Clipboard in Max.

Not sure if any of the modules are retaining memory of passed data - but I still don't know why that would makes this process so slow.

5038.dode4c.txt.zip
zip
ak's icon

Other possible reason: for each (line $1) message [text] is probably seeking that line from beginning of a file.

seejayjames's icon

Sorry to hear that the tweaks didn't work to speed things up...

Andrzej, I bet that's it! Any way to work around that?

Have you tried reading it into [coll] instead? Because you can use the "next" message in [coll]. Maybe that would save the time spent searching from the start of the file, if that's indeed the issue.

You'd need to add line numbers to the file first so it can be opened by [coll], but should be doable. Actually, you could use [text] to add the line numbers (prepend with your counter numbers) and send it into another [text], save the new version, then try [coll] for parsing. Might be worth a try...

meeble's icon

hey seejayjames,

coll won't work, because these text files are generated by another program - some lines that start with the word "frame" pass along other needed variables and values, it's not just repeating lines of numbers.

seejayjames's icon

yeah, I thought maybe that was the case...dangit. Any way to get the values by themselves somehow, so it's coll-friendly? Maybe that's too much effort to modify the other program. BTW what is the data, out of curiosity?

meeble's icon

the idea of pre-processing the text output of the other app is possible, but not ideal. These text files are frames of points to be sent to a laser projector over time, and I will need a way to keep frames separate in the Max app, hence the "#" comment lines. I will likely need to keep other information about each specific sequence of frames, hence the lines beginning with "frame".

pid's icon

Use dict. Fast.

meeble's icon

hey thanks for the idea, pid. Have you seen the sample text file format I attached above? how would I go about loading in such a text file into a dict object? And can dict objects easily contain 75,000+ pr even 250,000+ objects/records in them?

The whole point is to take the native text file output from another app, parse it line by line, and then use the various lines of data and parameters to both fill matrices and set pertinent values within the Max patch.

Andrew Pask's icon

My suggestion for the best way to do this would be to load it in to JS using the File() object

Something like

function load(){

    var lines = new String();
    var txt_file = new File("/Users/lalalala/Desktop/dode4c.txt");

    while (txt_file.position != txt_file.eof){
        lines += txt_file.readline();
    }
    txt_file.close();
    outlet(0,"done");

    }

loads the file for me in under a second. Plus now that you're in JS you can easily format the data for your jitter matrix.

We could consider an optimisation feature request for the text object, but this would be low priority given the performance of other solutions.

Cheers

Andrew

meeble's icon

Thanks, Andrew. Although I'm sure you are correct - I don't know JS, so even the simplest of tasks presents another hurdle for me.

Andrew Pask's icon

Actually - this might be more of a case of the scheduler getting flooded than anything else. I'm not sure that that there is currently a way of doing this sort of procedural thing over that size of data in a patcher in any reasonably quick way.

Do you need this in realtime? Perhaps instead of using uzi based approaches to this you could use something based on a qmetro 1, and have a low priority "formatter" patcher which just chews away on the files in the background, turning them into matrix jxfs.

Cheers

meeble's icon

Andrew,

No, this doesn't need to be a real time thing - I just want it to be as fast as possible, as I deal with a lot of these kinds of large data files. If you look further down the thread, I changed the patch to use a metro/counter object instead of an uzi, but I still experience major slowdowns - where processing a 200,000 line file takes so long it is useless.

perhaps I will do a search for a good Max JavaScript tutorial...

seejayjames's icon

You could probably try something similar to the version Andrew posted, just have the lines get spit out and do the parsing you need outside the js. Or if you can figure out how to do it inside the js that might be faster. It seems like the parsing etc. isn't the time issue, it's [text] accessing each line sequentially. Maybe the js eliminates this problem.

How about this:

function load(){

var lines = new String();
var txt_file = new File("/Users/lalalala/Desktop/dode4c.txt");

while (txt_file.position != txt_file.eof){
outlet(0, txt_file.readline()); //not sure if this works without assigning the line to a variable first?
}

txt_file.close();
outlet(0,"done");
}

Outside, you'd parse the lines the way you're already doing, just add "done" to your possible [route] matches.

If there's an issue with flooding in the while loop (try it first!), you could experiment using low-priority with the Task object:
https://cycling74.com/docs/max5/vignettes/js/jstaskobject.html

Lots of good js tutorials in the Max docs. Definitely one of the friendlier coded languages IMO. For this purpose it might make a big difference.

meeble's icon

thanks, Andrew/seejayjames.

I figured out how to trigger the load function by simply sending a "load" message into the js object. :)

How difficult would it be connect some kind of file-loading UI element to this JavaScript rather than hard-coding the file name?

UPDATE: the js object parses text files a LOT faster than the text object! I do get a spinning beach ball for 30 or so seconds while it is working - not a big deal.

Interesting to note: when the text object was doing it's parsing, one one of my 8 CPU cores was being used @ 100% according to top. The js object is using almost 100% of 4 cores.

seejayjames's icon

"How difficult would it be connect some kind of file-loading UI element to this JavaScript rather than hard-coding the file name?"

Just use [dialog] then [prepend load] on dialog's output, then into the js, should do the trick.

Yep, any function inside your js is called like that: "load $1" or for example "multiply $1 $2" if you had a function that takes 2 arguments and multiplies them. Very easy to interface to the js functions from the patch.