regexp & line feed …




-13 

#60804

\s+

I tend to use \s+ to get past line breaks, as sometimes they are line feeds and sometimes carriage returns and often followed by tabs or spaces and this method matches at least one consecutive white-space characters.

I’d also recommend ([^>]+) for finding text between HTML tags, as it involves less backtracking to find actual matches. It searches for anything but a closing triangle bracket and (as long as you don’t have any literal ones in your text string – they should be encoded in valid HTML) will work a lot faster. I tend to avoid .* unless I can’t find a way around it.

The only other changes I’ve made is replacing your string of dots with \d{5} as in the URL it appears that the only unique part is a row of 5 numbers and escaping the literal quotes that appear around the URL.

I hope this helps, let me know if you need a better explanation!

#218856
Jan 19, 2012 at 4:04am

regexp & line feed …

Here i need to extract different strings & values in an html code. The interesting elements are separated by line feeds

Here is the type of parts which interest me, as shown in jit.textfile :

Aasiaat
Jan 19, 2012 at 11:32am

[jit.str.regexp @re "

([^< ]+)\s+ \s+([^< ]+)“]
Jan 22, 2012 at 5:44pm

Thanks again for this Luke. Very useful.
I think i couldn’t make it much cleaner, removed all *, it works fine.

Now i would have liked to use @substitute, in order (if possible) to “print” all data on a single line, but still it’s no hard work to group them in a coll.
Yet i don’t understand the use of @substitute…

– Pasted Max Patch, click to expand. –

THANKS !

#218857

You must be logged in to reply to this topic.