regexp & line feed ...

    Jan 19 2012 | 4:04 am
    Here i need to extract different strings & values in an html code. The interesting elements are separated by line feeds
    Here is the type of parts which interest me, as shown in jit.textfile :Aasiaat

    • Jan 19 2012 | 11:32 am
      [jit.str.regexp @re " ([^\s+
      \s+ \s+([^"]
      I tend to use \s+ to get past line breaks, as sometimes they are line feeds and sometimes carriage returns and often followed by tabs or spaces and this method matches at least one consecutive white-space characters.
      I'd also recommend ([^>]+) for finding text between HTML tags, as it involves less backtracking to find actual matches. It searches for anything but a closing triangle bracket and (as long as you don't have any literal ones in your text string - they should be encoded in valid HTML) will work a lot faster. I tend to avoid .* unless I can't find a way around it.
      The only other changes I've made is replacing your string of dots with \d{5} as in the URL it appears that the only unique part is a row of 5 numbers and escaping the literal quotes that appear around the URL.
      I hope this helps, let me know if you need a better explanation!
    • Jan 22 2012 | 5:44 pm
      Thanks again for this Luke. Very useful. I think i couldn't make it much cleaner, removed all *, it works fine.
      Now i would have liked to use @substitute, in order (if possible) to "print" all data on a single line, but still it's no hard work to group them in a coll. Yet i don't understand the use of @substitute...
      THANKS !