Substrings/subsitutions/regular expressions in C

fairesigneaumachiniste

As standard C does not come with a regular expressions library (as far as I know) so what would be the best way to implement these expressions I use in the regexp object:

[regexp -M[\w].+(?i:ms)] - substrings

[regexp (-M|ms) @substitute " "] - substitutions

There may be a way to make it simpler. I want to search a list of various atoms which I have modified simplelist to do and then match elements using regular expressions and output them in a different form.

So using the two regexp I can convert -M34.6ms to 34.6.

Perhaps this is possible using 'if then' but I can't think how to split the symbols up without regular expressions.

I hope this is clear.

Thanks!

fairesigneaumachiniste

Here's a patch in max which demonstrates roughly what I'm trying to do:

Max Patch

Copy patch and select New From Clipboard in Max.

I want to write this in an object as I have lots of different atoms to match of varying formats which would be criminally inefficient in Max (and I want to practice my dev skills! ;)

Joshua Kit Clayton

On Nov 25, 2008, at 3:56 AM, fairesigneaumachiniste wrote:

> As standard C does not come with a regular expressions library (as
> far as I know) so what would be the best way to implement these
> expressions I use in the regexp object:

I could suggest you use the PCRE library (http://www.pcre.org), which
is what the regexp max object does, but if the max regexp object is
slow, then you will not see any performance improvement in your own
object which uses PCRE. So what would I suggest? A tight string
walking function of your own. or sscanf/strtok/etc might be useful if
you have a decent expectation of what the input is like.

From your simple example you probably can do something along the
lines of the following quicko email client code

void stripnumber(char *dst, const char *src)
{
char c;

    // keep numbers, spaces, and periods
    // strip everything else
    while (c = *src++) {
        switch (c) {
        case ' ':
        case '.':
        case '0':
        case '1':
        case '2':
        case '3':
        case '4':
        case '5':
        case '6':
        case '7':
        case '8':
        case '9':
            *dst++ = c; // copy char from input to output
            break;
        default:
            //skip
        }
    }

*dst = ''; // null terminate output
}

Here's a simple tutorial on working with strings in C:
http://www.eskimo.com/~scs/cclass/notes/sx8.html

sscanf, strtok, + other string tutorials:
http://crasseux.com/books/ctutorial/sscanf.html
http://www.gnu.org/software/libtool/manual/libc/Finding-Tokens-in-a-String.html
http://www.gnu.org/software/libtool/manual/libc/String-and-Array-Utilities.html

You can also get into finite automata implementations for regular
expressions which can be much faster than the typical backtracking
algorithms like perl and pcre use. Here's one reasonably clear paper
on the subject with code samples if you're feeling really nerdy.

http://swtch.com/~rsc/regexp/regexp1.html

If you get deeper into string processing in C, you'll also need to pay
attention to UTF-8 unicode representation as well if you want to
handle non ASCII characters:

http://en.wikipedia.org/wiki/UTF-8

Hope this gets you started. If you have further questions about this
stuff, I'd suggest you search online. Obviously lots of info out there.

-Joshua

grimepoch

I'll also add that on OSX, there is a regex library installed, I believe it is the POSIX implementation, see regex.h

That said, from what I have read, I would be careful as to how you use it because it can be slow. I have not witnessed this yet in my implementation of it in my external.

The functions of interest or:

regcomp
regexec

You can search for substrings with () very easily using an array of offsets to the matches. I then use the standard string functions to build up what I want.

However, if efficiency is what you want, I'd follow the advise and create very specific functions with the c string commands that are going to be ULTIMATELY faster.