Regexp is (no) fun?

MaxMSP

Tj Shredder

jayrope

"?" is a kind of "or", or "else".

This regex
reg(ular expressions?|ex(p|es)?)
searches for the "regular expressions", "regex" and "regexp".

Found this example on
http://www.regular-expressions.info/tutorial.html

jrp
p.s.: couldn't make it on thursday, down with the flu, progging in bed.

nick rothwell | project cassiel

On 29 Nov 2008, at 12:40, jayrope wrote:

> "?" is a kind of "or", or "else".

"|" is an "or" (or "|" depending which dialect of regexp you speak,
just to make it more fun).

"?" is a "maybe."

-- N.

Nick Rothwell / Cassiel.com Limited
www.cassiel.com
www.myspace.com/cassieldotcom
www.last.fm/music/cassiel
www.reverbnation.com/cassiel
www.linkedin.com/in/cassiel
www.loadbang.net

gusanomaxlist

Stefan Tiedje wrote:
> Another one I don't get: I found out how to change a "." into white
> space: [regexp (\.) @substitute " "] (should be in the examples...)
> But how to do the opposite? to change a space into dots or underscores
> etc..?
>

Hi.
I can only help you with that one:

regexp [\s] @substitute _

Ciao

Tj Shredder

Luke Hall

I agree, regexes can be quite confusing but once you understand what all the characters mean and how the engine progresses through a string you might find it becoming a little clearer.

Here's how to remove every instance of a single space.

[regexp " " @substitute %0]

%2 refers to the second pair of capturing parenthesis () in the regex. (%1 for the first, %3 for the third occurence, etc.) The %0 above works a similar way, instead of removing the space it substitutes it for an empty string.

Emmanuel Jourdan

On 29 nov. 08, at 12:05, Stefan Tiedje wrote:

> One which does, is one I don't understand. its a regex which is in
> the "Regexp is fun!" subpatcher to find if a string ends
> with .aif, .aiff, .wav, .wave. What does the "?i:" do exactly before
> "aiff?|wave?" It seems to deal with Upper/Lower case matching, but I
> have no idea how or why...

to summarize:
( ) create a backreference (it will memorize that portion of
text, it's also a way of grouping element so you can apply quantifiers
(?, *, +, {}). The problem with those is that it consume memory
because it have to memorize the all thing

(?: ) is only for grouping, it doesn't create any backreference
(the second outlet of the regexp object won't output anything between
those parenthesis)

(?i: ) defines a portion of things grouped, where the research is
not case sensitive.

so in the case of the one which is in the help file: .+\.(?i:aiff?|
wave?)
. any character

+ quantifier which apply to the "thing" (can be a character, will card
as it is here, or any group of other things) before and which means
that the thing before should appear at least once

\. as the dot is a special key we need to escape it to represent the
litteral character dot (and not the one which have the meaning of any
character), and for the escaping we need to backslashes (one for the
regexp, one for the max object)

(?i: ) the famous non capturing parenthesis which are case insensitive

aiff? means the letters aif followed optionally by another f (a
quantifier only apply on the last character, unless you explicitly
group multiple characters with parenthesis)

| means or (it's either aiff? or wave?)

wave? means the letters wav followed optionally by a e

> regexperts also have their own nomenclature. I was searching the http://www.regular-expressions.info/
> site to find out about "?i:" but its almost impossible to find it,
> as the syntax usually consists of single characters, and that is
> hard to match, even with regexp... But then they have their words
> for it as well, but if you don't know them there is no chance to
> find them...

The thing with regexp is that there's not only one way to express the
same thing, and depending on what specific task you want to achieve
you may come with different solution.

> Another one I don't get: I found out how to change a "." into white
> space: [regexp (\.) @substitute " "] (should be in the examples...)
> But how to do the opposite? to change a space into dots or
> underscores etc..?
> I couldn't get it. In general we need examples which deal with the
> special characters of Max...

for instance those two works fine:
regexp @re " " @substitute .
regexp @re \s @substitute .

the first one look for spaces " " (the quotes are required because the
space is used in the Max object to delimit arguments/attributes). the
second solution use the predefined class of character which are the
spaces (note that you have the opposite \S which represent anything
which is not a space).

> In file names I usually need to find a match from the end, but
> regexpressions usually stop searching once a match occurs. What is a
> common way to let it find the last match?

I added one in the strippath help file too:

regexp .+/(.+)

this means:
.    any character
+    quantifiers (the thing before appears at least once, but because of
the DFA engine behind the quantifiers are greedy by default which
means that it will match for the entire thing and then go back to find
the next character, in this example the /)
/    litteral character, we are definitely looking for a /
()    backreference: we memorize everything between those parenthesis
.+    again anything which appears at least once

so because of the way the regexp engine works, we will only memorize
everything which is after the last /.

> And how to substitute a match with nothing? cut it out??? (needs an
> example I guess...)

%0 does that.

> In the lcd example I miss an explanation of the %2 notation, I
> couldn't find any in online regex tutorials. What does it do? It
> seems to replace but what is the syntax. If there are online
> resources, Id like to see links in the help file as well...
> The one I mentioned above is a good tutorial, but is neither
> searchable nor complete, leaves too many questions unanswered...

this example in the help file:
regexp (paint|frame)(rect|oval) @substitute paint%2

could also be written (if that makes more sense):
regexp .+(rect|oval) @substitute paint%1

Basically %n is a way to use what you memorized (the backreference
between parenthesis), in the substitution string. %0 is a special case
which means no characters, %1 is the first backreference, %2 the
second...

I promise at some point I'll make some tuto for the website, in the
meantime, and in any case the bible about that, and I can't recommend
reading it enough is called Mastering Regular Expression from Jeffrey
Friedl it's definitely the thing that you wanna have for christmas ;-)

Cheers,
ej

Emmanuel Jourdan

Luke Hall

Personally, I think it's an extremely powerful and valuable tool but it is definitely hard to wrap your head around. I haven't found anywhere online that teaches me clearly and concisely how to use regexes as a beginner, although the site you mentioned in your original post can be helpful if you know what you are looking for (which in my case is rare!)

First of all, the % sign specifies a back-reference in a substitution string. The number after the % specifies which back-reference to use in the substitution, the first one is number 1.

?: makes the parentheses () non-capturing. For example usually parentheses would create a back reference to whatever was contained within them (and also be sent as part of a list from [regexp]s second outlet). But sometimes you want to enclose something in brackets to seperate it from the rest of the regex but without creating a back-reference to it. Here's an example:

Max Patch

Copy patch and select New From Clipboard in Max.

Technically the non capturing parentheses are an empty modifier span. (?i: ) is a modifier span that says everything within the brackets should be case insensitive. The opposite would be (?-i: ) which would force things to be case sensitive. You can use these without the colon to refer to everthing to the right of the modifier instead of everything inside the parentheses: (?i) and (?-i). It's probably easier to explain with another example:

Max Patch

Copy patch and select New From Clipboard in Max.

The patch you posted will work if you change your regex to [regexp (#) @substitute %0] The "\b" at the end doesn't exist in the string containing the "#" so the regex will return false. If you want to match the character at the end and only at the end of the string use [regexp (#) @substitute %0]. The $ is an anchor and means "match the preceeding characters at the end of the string", The opposite is ^ which looks for the following characters at the beginning of the string.

If any of my explanation isn't clear then let me know which specific bits are troubling you and I will try to come up with some examples where I can explain what the regex is doing character by character.

Emmanuel Jourdan

fairesigneaumachiniste

Here's one I'm struggling with:

Max Patch

Copy patch and select New From Clipboard in Max.

Emmanuel Jourdan

Luke Hall

What you have here is two seperate problems.

Your regex deals with the "symbol enclosed in quotes" but the unquoted one doesn't work. The first problem is not with the disappearance of the trailing zeros but with the commas, these are unescaped so max interprets them in the usual way by splitting the string into seperate messages. Try sending the unquoted message directly to print and you'll see what I mean. In the patch included below is an abstraction for grouping together streams of data sent in this manner. That is problem number one solved.

Now if you look at what is coming out of the abstraction you will notice that it isn't the same as the quoted version on the left of the patch. This means we have to extend the regex to be able to understand both versions. One with commas and trailing zeros and one without. This isn't too tricky.

First we look for \d+ which is one or more numeric digits.
Then for \.? which is a literal period occurring once or not at all.
Finally we look for \d* another zero or more decimal digits.
All this is enclosed in brackets to form a back-reference (\d+\.?\d*).

To seperate the numbers I get the regex to look for \D+ something other than a numeric character occuring at least once. This accomodates both our possibilities: a space, or a comma-space combo.

Now originally I just copied this string four times over in the regex because I was looking for 4 numbers. This is the top example in the patch below. However I've also put together a second version which can cope with any amount of numbers, this is the lower example, I'm not entirely sure that it is the "proper" way to achieve this but it seems to be working just fine at the moment. I hope it helps.

Max Patch

Copy patch and select New From Clipboard in Max.

Peter Nyboer

I, too, am not a regular expression virtuoso, so I keep a running example patch for personal reference.
I also find it somewhat more convenient to use javascript for some of these duties, as it can be a bit tidier and easier.

anyway, here's my regexp examples patch, maybe someone else will find it useful:

Max Patch

Copy patch and select New From Clipboard in Max.

Luke Hall

Oops, here's a version that should understand negative numbers as well. I hope it helps.

Max Patch

Copy patch and select New From Clipboard in Max.

dalinnen

just to mention it, i came across this a while ago and find it indispensable for dealing with regexp tasks.....

http://gskinner.com/RegExr/

there is a downloadable version under "desktop version".

if you need to work with text files, text wrangler has wonderfully functional regexp features within its find/replace functions.

and if you are working in max, you can also do regexp's and the like through jstrigger (or js). in case that is more friendly.

--dave linnenbank

Tj Shredder

fairesigneaumachiniste

That's fantastic. I'll have to work through the 'endstrem' subpatch as I'm not sure how it's working but that solves my problem.

Thankyou.

Max Patch

Copy patch and select New From Clipboard in Max.

Quote: thereishopeforus@hotmail.com wrote on Tue, 02 December 2008 00:02
----------------------------------------------------
> Oops, here's a version that should understand negative numbers as well. I hope it helps.
>
> lh
>
>
>
>
>
----------------------------------------------------

fairesigneaumachiniste

Thanks, I just worked out endstream. It's basically thresh but for atoms rather than just numbers. Very clever.

Thanks.

Luke Hall

Yeah I'm sorry I should have explained it better for you. I'm still working out how the condensed [regexp] thing works and how to properly implement it. I'm not entirely sure why you have to put the (?: )? around the first part of the regex that you don't want repeated, I'm sure that's just me over looking something fairly obvious though.

Here's the help file I made for endstream. I'm about to make it available for Max5 in a collection of abstractions that I've used since starting with Max. If you want to cast your eyes over them then let me know, it would be good for me to have someone else sift through them. I'm sure there's a few things that are unnecessary or could be simplified but I haven't noticed yet.

Max Patch

Copy patch and select New From Clipboard in Max.

fairesigneaumachiniste

I'll send you a mail off list and I'll have a look through your abstractions if you like.

endstream is really useful and I already added it to my object list.

Max Patch

Copy patch and select New From Clipboard in Max.

Quote: thereishopeforus@hotmail.com wrote on Wed, 03 December 2008 14:57
----------------------------------------------------
> Yeah I'm sorry I should have explained it better for you. I'm still working out how the condensed [regexp] thing works and how to properly implement it. I'm not entirely sure why you have to put the (?: )? around the first part of the regex that you don't want repeated, I'm sure that's just me over looking something fairly obvious though.
>
> Here's the help file I made for endstream. I'm about to make it available for Max5 in a collection of abstractions that I've used since starting with Max. If you want to cast your eyes over them then let me know, it would be good for me to have someone else sift through them. I'm sure there's a few things that are unnecessary or could be simplified but I haven't noticed yet.
>
> lh
>
>
>
>
----------------------------------------------------

Tj Shredder

Luke Hall

Pretty much yeah it is. But a delay that can auto set its delay time to be just as long as you need.

Try sending a stream of thousands of lists (or similar ridiculously large amount of data) and watch delay choke unless it's time is set high enough. And if you set the time really high in anticipation of a a large stream then you sit around waiting when a short stream is sent. With endstream it sends out a bang just after the last piece of information is sent no matter how long this actually takes.

At least I hope it does all that, if I've made a stupid mistake feel free to point it out in more detail. Constructive criticism please =)

And to fairesigneaumachiniste, drop me an email and I'll send you what I've got. Some I'm pretty happy with and some are pretty much useless (conversions using [expr] so I don't have to do as much maths etc). Having someone else look at them always helps. I was using an abstraction I made for ages before I relaised [listfunnel] already existed as an object!

Emmanuel Jourdan

Tj Shredder

Stefan Tiedje schrieb:
> But I came across another problem.

Another one I don't know why it fails:

Why does it put out the number before the match I want? And why is it
creating a symbol? I tried to include a beginning or whitespace, but
that result is even weirder what should a "" 1/4 mean. That doesn't make
sense for me at all...

I want integer/integer or float/integer for example 3/4 or 0.75/8 no
matter where they are in a list. Should be simple, but it seems I give
regexp difficult stuff to digest...

Stefan

Max Patch

Copy patch and select New From Clipboard in Max.

--
Stefan Tiedje------------x-------
--_____-----------|--------------
--(_|_ ----|-----|-----()-------
-- _|_)----|-----()--------------
----------()--------www.ccmix.com

Luke Hall

This regex will find int/int or float/int in any sring:

[regexp (\d+\.?\d*/\d+)]

It looks for an integer one or more times.
Then a decimal point that occurs once or not at all.
Then another integer zero or more times.
Then a divide sign.
Then another integer one or more times.

The important bits being the deimal point and numbers between this and the divide sign can occur zero times. The string will match even if they aren't there.

In your first attempt the (^|\s) means "try to match ^ or some whitespace". When your string contains whitespace before the fraction you're trying to match it remembers it as the first backreference and outputs it as the first element of the list, the second being whatever fraction matched.

Another problem you have is using an unescaped decimal point. In a regex a period means "match any character" so you have to escape it \. to make the regex match the literal character. This is why you have "3 4/5" returned. The period matches the space. I hope this has cleared up the confusion.

Tj Shredder

Luke Hall

Have a look in my reply above. The decimal point is a special character that matches anything, including a space. This is why the "3 4/5" string is passed. To use it as a literal character you must input it into the regex as \.

Luke Hall

Oh tell me about it, good for problem solving, better than sudokus in my opinion! I'm wondering if your issue with the regex returning a match and the "no matches" string at the same time has anything to do with you placing one set of capturing parntheses inside another. I'll look into it a bit deeper.

Tj Shredder

Adam Murray

Quote: stefantiedje wrote on Sun, 07 December 2008 23:37
----------------------------------------------------
>
> I wondered about nested parentheses anyway, how
> are they counted...?
>

Every left parenthesis starts a new backreference:

Max Patch

Copy patch and select New From Clipboard in Max.

Tj Shredder

Luke Hall

Jitdoku! I think you might be on to a winner there Emmanuel, if I were you I'd act quickly before someone steals your idea!

And well done Stefan I knew you'd see the light eventually. Perhaps you should compile a patch of [regexp] examples, like the one Peter posted in this thread previously. I know I've got quite a few loitering around.

Emmanuel Jourdan

hekeus

Hi, so I'm parsing an osc style message with regexp and I'm using backreferences.

I've generally been having stability and memory issues in my pretty massive complicated patches, and would like to know if it might be the backreferences in my regexp objects which might be the issue.. looking earlier in the thread:

---
( ) create a backreference (it will memorize that portion of
text, it's also a way of grouping element so you can apply quantifiers
(?, *, +, {}). The problem with those is that it consume memory
because it have to memorize the all thing
---

Max Patch

Copy patch and select New From Clipboard in Max.

I wonder if the way I'm parsing is causing memory issues..

I receive hundreds of these messages a second! Is regexp the source of my troubles??

Thank you!
H

ShelLuser

I know this is an old thread but I'd rather dig up old (relevant) threads than starting copies.

Yesterday I've finished my 4 part regexp tutorial on my blog, you can find an overview all my tutorial posts here (link), this overview includes the 4 regexp parts.

Now, I know this is a little bit "spamming" but I got very positive results from many people while writing this tutorial so I figured that if others had regexps problems they might benefit from these posts too. And that's what got me to this post.

Hope this can help you too!