search inside text without using PERL ?

    Oct 28 2011 | 4:21 pm
    i'm attempting to achieve a patch where i get - and match - some parts of code inside an html page.
    I feel rather uncomfortable with PERL, so i wonder if there would be a way to perform some "search in text" in MaxMSP without the use of PERL at all. Unless it gets much more complicated ...?

    • Oct 29 2011 | 2:34 am
      Regular expressions are probably the most efficient, if not the easiest, way to accomplish this. I have a tutorial built as a max patch that you might find helpful if you click the tools link under my name. If you do end up getting stuck then post a little bit more about the text you are searching and the string you are looking for and I will happily take a look for you.
    • Nov 03 2011 | 10:55 am
      Thanks a lot Luke, your tutorial helped me, at least to get something out of my html code.
      So, as a first and dirty step, here is my example. I'll use some seismic data from this website :
      The lines of code which are interesting look like this :
      tts[39]="MAG 4.6 03-NOV-2011 07:55:00 NEAR COAST OF CENTRAL CHILE"; tts[40]="MAG 4.9 03-NOV-2011 06:37:49 SOUTH SANDWICH ISLANDS REGION"; tts[41]="MAG 4.4 03-NOV-2011 05:42:22 FLORES REGION, INDONESIA"; tts[42]="MAG 4.8 03-NOV-2011 01:51:41 NEAR COAST OF CENTRAL CHILE"; tts[43]="MAG 4.7 03-NOV-2011 00:19:00 SOLOMON ISLANDS";
      So i used this : (ok i still don't really know what i'm doing)
      regexp (MAG)\s(\d.\d)\s(\w+)-(\w+)-(\w+)\s(\w+):(\w+):(\w+)\s @substitute "%1 %2 %3-%4-%5 %6:%7:%8 %9"
      - i used the first backreference (MAG) as a "detector" so it outputs only the needed lines - then date & time
      But then i'm stuck with the following content. How can i keep the end of the line as a whole, considering it can be separated by a coma or not, and contain variable amount of words ? (and eventually, remove unwanted characters like or " ...)
      Also, in order to make it a little cleaner, i would have liked to be able to get the beginning of the line : tts[number] ... and erase it using %0 ... But yet it gets too hard for me :(
      Thanks again. A little better but still very obscure !
    • Nov 04 2011 | 12:25 am
      Here's how I would grab all the data following the string "MAG" and store them in a [coll] for splitting and using later. If you look in the regexp tutorial file where it shows you how to parse HTML tags you can see a quick method of how to search for everything but a certain character, this method is speedy and will let you have variable length strings, as long as the text you are parsing can be relied upon to be in a certain format. Seeing as this appears to be the case in your examples you don't need to individual refer to every single group of letters or numbers. You can select everything you need in one go and split them up in max outside of the [jit.str.regexp] object.
      The expression I've used is this below and I'll explain it step by step for clarity:
      MAG - the literal string which indicates the start of the data you want \s - any whitespace character ( - start a backreference [ - start a character class ^ - negate the character class (anything but...) " - a literal double quote ] - close the character class * - repetition metacharacter ) - close the backreference " - a literal double quote
      This means the regexp will find and store any characters that are not a double quote mark between the string "MAG" and the first double quote mark it finds. This will be sent from the second outlet and you can slice and dice it in max as you see fit.
    • Nov 04 2011 | 3:39 pm
      Thanks for the explanations :)
    • Nov 05 2011 | 8:01 pm
      ok, now i try to go a bit further ... But despite the plethora of messages here that talk about regular expressions, and the details you gave me, i spend hours without success ...
      This previous example was a continuous content following one string.
      the syntax i'm focusing on now is like this :
      ... etc
      So i'm trying to get the data immediately following many strings (here, coords & title only)
      But how to explain there are commas as separators into first backref ?
      i tried many things, without getting any data out. I tried jit.str.regexp @re coords="([^,]*)" because it seems logical. But doesn't work. So i thought the comma needs to be literal maybe, so i put a , it doesn't work so i put two \. It doesn't work either.
      Stupidly, i don't know, also, how to say that the two strings need to follow each other
      When i do jit.str.regexp @re coords="([^"]*)"\s?title="MAG\s([^"]*)"
      don't know if i have to put a ? or a + in between.
      and finally (!!) if i add an @substitute %1 %2, in order just to get the values, Max crashes.
      i'm sorry to be such a hassle, but my first contact with regexp is freaky.
    • Nov 06 2011 | 12:41 am
      Try this: jit.str.regexp @re coords="([^,]*),([^,]*),([^"]*)". It will report the three groups of characters separated by commas inside quote marks. You can always combine it with the previous expression to get the magnitude and location data.
    • Nov 07 2011 | 1:30 pm
      thanks, i tried it earlier ... Max doesn't want ([^,]*), as it removes automatically the backslash - but it seems to work.
      Doing this, the coll gets a new index at each data incoming, so that i get 4 lines instead of one.
      also, do i need to put ? or + between coords="([^,]*),([^,]*),([^"]*)" and title="MAG\s([^"]*)"
      i don't get the difference
      finally, why does max crash if i had a @substitute ? bad syntax or bug ?
    • Nov 07 2011 | 11:04 pm
      You don't really need to be using @substitute, and in that patch you're using the backreferences outlet rather than the substitutions one. Here's how I'd group the text into one [coll] entry per line:
    • Jan 09 2012 | 9:48 pm
      Hey Luke and tep, I'm working on a similar project that scans websites and then pulls the text, parses it and displays the headlines. I can parse the site's html but I end up with a bunch of extra text and not just headlines. any help is greatly appreciated: