A little jit.str.regexp help for a noob
I’m in way over my head here, but perhaps someone can steer me in the right direction. Bear with me, I have no computer science background whatsoever! Nonetheless, trying to get my head around the following:
In this patch you’ll see I’m trying to use jit.str.regexp to match some text in a website. I’m trying to match text in between two things, so I’m using "(.*)"
If you try Example 2 on the right, it works fine. I can get "National Geographic All Rights Reserved" by using "/National(.*?)Reserved/"
However, if you try Example 1 on the left, the same approach does not find the match "Solar System Live". In fact, I can’t generate ANY matches at all from that url, even just something like a single word, like "Solar" by using "/Solar/"
What could it be about Example 1 that is prohibiting jit.str.regexp from working as it does in Example 2?
Any suggestions greatly appreciated–many thanks.
----------begin_max5_patcher---------- 1086.3oc6YEsaaaCE8Y6uBB8T5PprHojkTeYMXnXuzMLz8x.FJJnsXrYlDo. Ecr6J5+9DojhcRijnyTTrW2KxfTzhm6gWetmq7WlNwYgXGsvA7FveBlL4KSm LwLkdhI0im3jQ1sLkTXVlyRQVFkqbtr5dJ5NkY928GW8K+16eG.AtXqP9WL9 pW0rlqkqVnWimqW8L4D0x0kK4SR5RU0t6iwk2FDpu.ilq+.U9E.er9qvRLai XwMuFGd2SVvUbRF0bqqjLRZyc3axX7Tpx.Z39IEaTMy5cvCof82lGBD45om8 qSmpub4.QJPvEbgB7zHl387RX27x7SMdgS2VBrugVJCNJ6VJXAguB5zUn6YB an4idRJBNxf26QCdXKAe8zUqT84bZEBcb.eb74EXXrKJ3R.1fr9HF+uiHFDx PL1kwfO+IlBJOoWVAECc0jRsRBtSRActns9SoTh7nTRQQwZdHxP.A9cRCvSM ZXwFkRvs3T1KPGXXulqOV3ghOhf.1VlrNu6okMmQKJHqneyYZhXKOUPR.qUp 72La11sacuVrQlwRIKbWtd1xUrWufwm86hThDjQTR1Nm9pcV4r.EGT4snEJI 5HOwQufpAsweTIofZQA0JdwG1Ig3+c.gbWACqXD7YDizqdgePvA4BHutjCgQ ur5EsT86FlxcSZRpSusU.Qg6OeaMJiGDQezQd9VFEepQIafsGnInBkzURWQ2 kCdqjBl8qDESvIoW39C+3q9.sfJuklLqeJD4WUwDE2EGh7FDNL3eEGZ+nAms 0CtlkRsHkLzvmAgcRmCiOD7.RmOuU7EKoDtKuNGcEUrRRxWyV5VZCr5lyrnr eEC2b0uKQcX3YjndKId4RFWAdatHeSNnqNABflrMbPUcNOS6iHXK4dAOedfw OIOvskDUPUcYLNzSaLFWoegQFWxsjM7epJ7gPqKvi8NaKvGac8cT3Ya881x7 aJlC946TJAWklB9.a0ZUAno7d+sMhihL+5n5EmzpXYv4uXYimHkn3yYKDcka gqrHWKWF0YkZH9EI4ZTrLZ5113W78raoy58Wi0VEgQc9NVfv+2pXeVEis1oH z6D2onAzNoL9C++lLwsd96SiEhMxkMaSSDB1G6IzBEqxn3gKBduEslkjP4GZ CIgUPVjRMD2i6AwZ7n2Jbe3AOt3wuG7nM0MV3Aay4kFOvwAOg1fm3wiehr.O n9xmyXI4hR6+0+nxetuweak2Oeezci1uQCWDDeuCuVXznwiQiGBFc3vi9kgb JoPXvyIjBwCNJZAOdiGdrRwZDwi+IV9b3IF+DZg9CdDwiUJ5giGdrQOT+O9N R3wZCAiDb7Fhz4GT.Nx6f5uAQOmkew1vmnwq7B1F4pQDNA1PO9OQ7T0NCIO+ Vprn9YZfRY+e2Hj5gyuzLjwqFZZ3xQRuk0rdSyVNDYYaepxd91Hq5YaWzbmo 584qS+G.HpRil -----------end_max5_patcher-----------
whatever i try, the Solar System page gives a PCRE error -10. It means that there is something fishy and inherently incompatible with regex in that web page… which is very strange, and should theoretically be solvalbe, just need to find what exactly…..
so, error code -10 means either
#define PCRE_ERROR_BADUTF8 (-10) /* Same for 8/16/32 */
#define PCRE_ERROR_BADUTF16 (-10) /* Same for 8/16/32 */
#define PCRE_ERROR_BADUTF32 (-10) /* Same for 8/16/32 */
(from PCRE source code header : http://vcs.pcre.org/viewvc/code/trunk/pcre.h.in?view=markup)
so "bad utf 8" or utf 16 or utf 32, a character encoding problem (https://fr.wikipedia.org/wiki/UTF-32).
Sooo if you take that textfile and remove the
encoding="iso-8859-1" thing at the beginning, jit.str.regex DOES work as expected… now, you have two choices :
*either removing programmatically that first line each time
*or understand why it’s interpreted by jit.str.regex as a character encoding command, which is potentially tricky, possibly not documented in Max or non dependant from Max. And, from there, give jit.str.regex useful commands so that it won’t behave like this.
Thanks for the replies! I think for me, being so out of my element with this, the best bet is to try and learn how to do your first suggestion, to remove that first line programmatically.
Or, find a different website to use for my little project!