search inside text without using PERL ?

tep's icon

Hi,

i'm attempting to achieve a patch where i get - and match - some parts of code inside an html page.

I feel rather uncomfortable with PERL, so i wonder if there would be a way to perform some "search in text" in MaxMSP without the use of PERL at all. Unless it gets much more complicated ...?

Luke Hall's icon

Regular expressions are probably the most efficient, if not the easiest, way to accomplish this. I have a tutorial built as a max patch that you might find helpful if you click the tools link under my name. If you do end up getting stuck then post a little bit more about the text you are searching and the string you are looking for and I will happily take a look for you.

tep's icon

Thanks a lot Luke,
your tutorial helped me, at least to get something out of my html code.

So, as a first and dirty step, here is my example. I'll use some seismic data from this website : http://www.iris.edu/seismon/

The lines of code which are interesting look like this :

tts[39]="MAG 4.6 03-NOV-2011 07:55:00 NEAR COAST OF CENTRAL CHILE";
tts[40]="MAG 4.9 03-NOV-2011 06:37:49 SOUTH SANDWICH ISLANDS REGION";
tts[41]="MAG 4.4 03-NOV-2011 05:42:22 FLORES REGION, INDONESIA";
tts[42]="MAG 4.8 03-NOV-2011 01:51:41 NEAR COAST OF CENTRAL CHILE";
tts[43]="MAG 4.7 03-NOV-2011 00:19:00 SOLOMON ISLANDS";

So i used this : (ok i still don't really know what i'm doing)

regexp (MAG)\s(\d.\d)\s(\w+)-(\w+)-(\w+)\s(\w+):(\w+):(\w+)\s @substitute "%1 %2 %3-%4-%5 %6:%7:%8 %9"

- i used the first backreference (MAG) as a "detector" so it outputs only the needed lines
- then date & time

But then i'm stuck with the following content. How can i keep the end of the line as a whole, considering it can be separated by a coma or not, and contain variable amount of words ? (and eventually, remove unwanted characters like or " ...)

Max Patch
Copy patch and select New From Clipboard in Max.

Also, in order to make it a little cleaner, i would have liked to be able to get the beginning of the line : tts[number]
... and erase it using %0 ... But yet it gets too hard for me :(

Thanks again. A little better but still very obscure !

Luke Hall's icon

Here's how I would grab all the data following the string "MAG" and store them in a [coll] for splitting and using later. If you look in the regexp tutorial file where it shows you how to parse HTML tags you can see a quick method of how to search for everything but a certain character, this method is speedy and will let you have variable length strings, as long as the text you are parsing can be relied upon to be in a certain format. Seeing as this appears to be the case in your examples you don't need to individual refer to every single group of letters or numbers. You can select everything you need in one go and split them up in max outside of the [jit.str.regexp] object.

The expression I've used is this below and I'll explain it step by step for clarity:

MAG\s([^"]*)"

MAG - the literal string which indicates the start of the data you want
\s - any whitespace character
( - start a backreference
[ - start a character class
^ - negate the character class (anything but...)
" - a literal double quote
] - close the character class
* - repetition metacharacter
) - close the backreference
" - a literal double quote

Max Patch
Copy patch and select New From Clipboard in Max.

This means the regexp will find and store any characters that are not a double quote mark between the string "MAG" and the first double quote mark it finds. This will be sent from the second outlet and you can slice and dice it in max as you see fit.

tep's icon

Thanks for the explanations :)

tep's icon

ok, now i try to go a bit further ... But despite the plethora of messages here that talk about regular expressions, and the details you gave me, i spend hours without success
...

This previous example was a continuous content following one string.

the syntax i'm focusing on now is like this :

... etc

So i'm trying to get the data immediately following many strings (here, coords & title only)

But how to explain there are commas as separators into first backref ?

i tried many things, without getting any data out.
I tried
jit.str.regexp @re coords="([^,]*)"
because it seems logical. But doesn't work. So i thought the comma needs to be literal maybe, so i put a , it doesn't work so i put two \.
It doesn't work either.

Stupidly, i don't know, also, how to say that the two strings need to follow each other

When i do
jit.str.regexp @re coords="([^"]*)"\s?title="MAG\s([^"]*)"

don't know if i have to put a ? or a + in between.

and finally (!!) if i add an @substitute %1 %2, in order just to get the values, Max crashes.

i'm sorry to be such a hassle, but my first contact with regexp is freaky.

Luke Hall's icon

Try this: jit.str.regexp @re coords="([^,]*),([^,]*),([^"]*)". It will report the three groups of characters separated by commas inside quote marks. You can always combine it with the previous expression to get the magnitude and location data.

tep's icon

thanks, i tried it earlier ... Max doesn't want ([^,]*), as it removes automatically the backslash - but it seems to work.

Doing this, the coll gets a new index at each data incoming, so that i get 4 lines instead of one.

also, do i need to put ? or + between
coords="([^,]*),([^,]*),([^"]*)"
and
title="MAG\s([^"]*)"

i don't get the difference

Max Patch
Copy patch and select New From Clipboard in Max.

finally, why does max crash if i had a @substitute ? bad syntax or bug ?

Luke Hall's icon
Max Patch
Copy patch and select New From Clipboard in Max.

You don't really need to be using @substitute, and in that patch you're using the backreferences outlet rather than the substitutions one. Here's how I'd group the text into one [coll] entry per line:

blaketurner's icon

Hey Luke and tep,
I'm working on a similar project that scans websites and then pulls the text, parses it and displays the headlines. I can parse the site's html but I end up with a bunch of extra text and not just headlines. any help is greatly appreciated:

3182.WorkingNewsService2.maxpat
Max Patch