A little jit.str.regexp help for a noob

    Sep 30 2013 | 9:09 pm
    I'm in way over my head here, but perhaps someone can steer me in the right direction. Bear with me, I have no computer science background whatsoever! Nonetheless, trying to get my head around the following:
    In this patch you'll see I'm trying to use jit.str.regexp to match some text in a website. I'm trying to match text in between two things, so I'm using "(.*)"
    If you try Example 2 on the right, it works fine. I can get "National Geographic All Rights Reserved" by using "/National(.*?)Reserved/"
    However, if you try Example 1 on the left, the same approach does not find the match "Solar System Live". In fact, I can't generate ANY matches at all from that url, even just something like a single word, like "Solar" by using "/Solar/"
    What could it be about Example 1 that is prohibiting jit.str.regexp from working as it does in Example 2?
    Any suggestions greatly appreciated--many thanks.

    • Oct 01 2013 | 1:06 pm
      whatever i try, the Solar System page gives a PCRE error -10. It means that there is something fishy and inherently incompatible with regex in that web page... which is very strange, and should theoretically be solvalbe, just need to find what exactly.....
    • Oct 01 2013 | 1:24 pm
      so, error code -10 means either
      #define PCRE_ERROR_BADUTF8 (-10) /* Same for 8/16/32 */
      #define PCRE_ERROR_BADUTF16 (-10) /* Same for 8/16/32 */
      #define PCRE_ERROR_BADUTF32 (-10) /* Same for 8/16/32 */
      (from PCRE source code header : http://vcs.pcre.org/viewvc/code/trunk/pcre.h.in?view=markup)
      so "bad utf 8" or utf 16 or utf 32, a character encoding problem (https://fr.wikipedia.org/wiki/UTF-32).
      Sooo if you take that textfile and remove the encoding="iso-8859-1" thing at the beginning, jit.str.regex DOES work as expected... now, you have two choices :
      *either removing programmatically that first line each time
      *or understand why it's interpreted by jit.str.regex as a character encoding command, which is potentially tricky, possibly not documented in Max or non dependant from Max. And, from there, give jit.str.regex useful commands so that it won't behave like this.
    • Oct 01 2013 | 2:07 pm
      Thanks for the replies! I think for me, being so out of my element with this, the best bet is to try and learn how to do your first suggestion, to remove that first line programmatically.
      Or, find a different website to use for my little project!