i'm attempting to achieve a patch where i get - and match - some parts of code inside an html page.

I feel rather uncomfortable with PERL, so i wonder if there would be a way to perform some "search in text" in MaxMSP without the use of PERL at all. Unless it gets much more complicated ...?

i'm attempting to achieve a patch where i get - and match - some parts of code inside an html page. 

I feel rather uncomfortable with PERL, so i wonder if there would be a way to perform some "search in text" in MaxMSP without the use of PERL at all. Unless it gets much more complicated ...?


search-inside-text-without-using-perl

Regular expressions are probably the most efficient, if not the easiest, way to accomplish this. I have a tutorial built as a max patch that you might find helpful if you click the tools link under my name. If you do end up getting stuck then post a little bit more about the text you are searching and the string you are looking for and I will happily take a look for you.

your tutorial helped me, at least to get something out of my html code.

So, as a first and dirty step, here is my example. I'll use some seismic data from this website : 

The lines of code which are interesting look like this :

tts[39]="MAG 4.6 03-NOV-2011 07:55:00 NEAR COAST OF CENTRAL CHILE";

tts[40]="MAG 4.9 03-NOV-2011 06:37:49 SOUTH SANDWICH ISLANDS REGION";

tts[41]="MAG 4.4 03-NOV-2011 05:42:22 FLORES REGION, INDONESIA";

tts[42]="MAG 4.8 03-NOV-2011 01:51:41 NEAR COAST OF CENTRAL CHILE";

tts[43]="MAG 4.7 03-NOV-2011 00:19:00 SOLOMON ISLANDS";

So i used this : (ok i still don't really know what i'm doing)

regexp (MAG)\s(\d.\d)\s(\w+)-(\w+)-(\w+)\s(\w+):(\w+):(\w+)\s @substitute "%1 %2 %3-%4-%5 %6:%7:%8 %9"

- i used the first backreference (MAG) as a "detector" so it outputs only the needed lines

But then i'm stuck with the following content. How can i keep the end of the line as a whole, considering it can be separated by a coma or not, and contain variable amount of words ? (and eventually, remove unwanted characters like or " ...)

Also, in order to make it a little cleaner, i would have liked to be able to get the beginning of the line : tts[number]

... and erase it using %0 ... But yet it gets too hard for me :(

Thanks again. A little better but still very obscure !

Thanks a lot Luke,
your tutorial helped me, at least to get something out of my html code. 

So, as a first and dirty step, here is my example. I'll use some seismic data from this website : http://www.iris.edu/seismon/

tts[39]="MAG 4.6 03-NOV-2011 07:55:00 NEAR COAST OF CENTRAL CHILE";
tts[40]="MAG 4.9 03-NOV-2011 06:37:49 SOUTH SANDWICH ISLANDS REGION";
tts[41]="MAG 4.4 03-NOV-2011 05:42:22 FLORES REGION, INDONESIA";
tts[42]="MAG 4.8 03-NOV-2011 01:51:41 NEAR COAST OF CENTRAL CHILE";
tts[43]="MAG 4.7 03-NOV-2011 00:19:00 SOLOMON ISLANDS";

- i used the first backreference (MAG) as a "detector" so it outputs only the needed lines
- then date & time

But then i'm stuck with the following content. How can i keep the end of the line as a whole, considering it can be separated by a coma or not, and contain variable amount of words ? (and eventually, remove unwanted characters like  or " ...)

Also, in order to make it a little cleaner, i would have liked to be able to get the beginning of the line : tts[number]
... and erase it using %0 ... But yet it gets too hard for me :(

Thanks again. A little better but still very obscure !


Here's how I would grab all the data following the string "MAG" and store them in a [coll] for splitting and using later. If you look in the regexp tutorial file where it shows you how to parse HTML tags you can see a quick method of how to search for everything but a certain character, this method is speedy and will let you have variable length strings, as long as the text you are parsing can be relied upon to be in a certain format. Seeing as this appears to be the case in your examples you don't need to individual refer to every single group of letters or numbers. You can select everything you need in one go and split them up in max outside of the [jit.str.regexp] object.

The expression I've used is this below and I'll explain it step by step for clarity:

MAG - the literal string which indicates the start of the data you want

^ - negate the character class (anything but...)

This means the regexp will find and store any characters that are not a double quote mark between the string "MAG" and the first double quote mark it finds. This will be sent from the second outlet and you can slice and dice it in max as you see fit.

MAG - the literal string which indicates the start of the data you want
\s - any whitespace character
( - start a backreference
[ - start a character class
^ - negate the character class (anything but...)
" - a literal double quote
] - close the character class
* - repetition metacharacter
) - close the backreference
" - a literal double quote

This means the regexp will find and store any characters that are not a double quote mark between the string "MAG" and the first double quote mark it finds. This will be sent from the second outlet and you can slice and dice it in max as you see fit.


ok, now i try to go a bit further ... But despite the plethora of messages here that talk about regular expressions, and the details you gave me, i spend hours without success

This previous example was a continuous content following one string.

the syntax i'm focusing on now is like this :

So i'm trying to get the data immediately following many strings (here, coords & title only)

But how to explain there are commas as separators into first backref ?

i tried many things, without getting any data out.

because it seems logical. But doesn't work. So i thought the comma needs to be literal maybe, so i put a , it doesn't work so i put two \.

Stupidly, i don't know, also, how to say that the two strings need to follow each other

jit.str.regexp @re coords="([^"]*)"\s?title="MAG\s([^"]*)"

don't know if i have to put a ? or a + in between.

and finally (!!) if i add an @substitute %1 %2, in order just to get the values, Max crashes.

i'm sorry to be such a hassle, but my first contact with regexp is freaky.

ok, now i try to go a bit further ... But despite the plethora of messages here that talk about regular expressions, and the details you gave me, i spend hours without success
...

This previous example was a continuous content following one string. 

the syntax i'm focusing on now is like this : 

But how to explain there are commas as separators into first backref ? 

i tried many things, without getting any data out.
I tried
jit.str.regexp @re coords="([^,]*)"
because it seems logical. But doesn't work. So i thought the comma needs to be literal maybe, so i put a , it doesn't work so i put two \.
It doesn't work either.

When i do
jit.str.regexp @re coords="([^"]*)"\s?title="MAG\s([^"]*)"

i'm sorry to be such a hassle, but my first contact with regexp is freaky.


jit.str.regexp @re coords="([^,]*),([^,]*),([^"]*)"

. It will report the three groups of characters separated by commas inside quote marks. You can always combine it with the previous expression to get the magnitude and location data.

Try this: jit.str.regexp @re coords="([^,]*),([^,]*),([^"]*)". It will report the three groups of characters separated by commas inside quote marks. You can always combine it with the previous expression to get the magnitude and location data.


thanks, i tried it earlier ... Max doesn't want ([^,]*), as it removes automatically the backslash - but it seems to work.

Doing this, the coll gets a new index at each data incoming, so that i get 4 lines instead of one.

finally, why does max crash if i had a @substitute ? bad syntax or bug ?

also, do i need to put ? or + between
coords="([^,]*),([^,]*),([^"]*)"
and
title="MAG\s([^"]*)"

finally, why does max crash if i had a @substitute ? bad syntax or bug ?


You don't really need to be using @substitute, and in that patch you're using the backreferences outlet rather than the substitutions one. Here's how I'd group the text into one [coll] entry per line:

I'm working on a similar project that scans websites and then pulls the text, parses it and displays the headlines. I can parse the site's html but I end up with a bunch of extra text and not just headlines. any help is greatly appreciated:

Hey Luke and tep,
I'm working on a similar project that scans websites and then pulls the text, parses it and displays the headlines.  I can parse the site's html but  I end up with a bunch of extra text and not just headlines.  any help is greatly appreciated:


search inside text without using PERL ?