Web scraping google images

    Apr 24 2009 | 12:27 am
    Hi everyone.
    I want to do a project for school that uses jit.uldl and jit.str.regexp to scrape images from google images. I've been using the parser/downloader patch found in the jit.str.regexp help file as an example, but have had little luck. From my understanding, this patch (unedited) is suppose to download the html source from the c74 homepage, send the info as a matrix to jit.str.regexp which parses the html and extracts the gif and jpg files, reconfigures them back into a web address and then sends that back to jit.uldl to be downloaded. Is that right? Anyway, it doesn't appear to be doing that, or I don't know where the images are being downloaded to. As a note, I sometimes get the error jit.str.regexp: PCRE error -10 in the max window.
    could anybody help me understand jit.str.exp better, and what exactly is going on in the jit.str.regexp parser/downloader example? does anybody have any good examples of web scraping that could help?
    thank you so much!

    • Apr 24 2009 | 12:42 am
      NOTE: I am getting the error jit.str.regexp: PCRE error -10 EVERY time I try to run the parser/downloader patch. does anyone know what this means? It might solve my problem.
    • Apr 27 2009 | 10:30 pm
      I had this problem too: Here's the soluce: you have to use a triger object: t b wclose open, connected to to jit.textfile. Apparently, if you don't open the jit.textfile window, it can not work.
    • Apr 28 2009 | 1:38 am
      works perfect! thank you!
    • May 03 2009 | 9:17 am
      this queries google images and downloads all the thumbnails. please let me know if you do anything interesting, i'd like to check it out.
    • Jul 22 2010 | 4:15 am
      I am new to Max, my professor suggest I use it for a project I am working on. I want to enable a user to input a word and have Max search for that word in Google Image Search and return an image, for example "dog".
    • Jul 22 2010 | 11:20 am
      That is definitely possible in max. As it is school work I won't tell you how to do it but work through the max tutorials to get to grips with the basics. Then take a look at the example patch in the [jit.str.regexp] help patch. It is called [parser/downloader]. The main objects for downloading the page source code are [jit.uldl] and [jit.textfile] for storing it. Then [jit.str.regexp] for iterating through the code to find the particular URL you want. Regular expressions can be a bit of a pain but there is lots of information on the forum already. One tip to get you started: the URL for a google image search of "dog" looks something like this: "http://www.google.com/images?q=dog".
    • Aug 01 2010 | 1:58 pm
      Possible with max. But Google has again changed the way how it structured the URL/path so i have to change my patch again. Maybe this can help:
      Look in which way the URL/path is structured. Like for example:
      &q=cat so it will search for the word cat ... and so on
    • Oct 18 2010 | 10:00 am
      Hello everyone,
      By tweaking around with [jit.str.regexp] I was able to download all the thumbnails from Google, with a certain word query. Now, how can I retrieve the freshly-downloaded images back in to Max and show them?
      Thank you so much,
    • Oct 19 2010 | 10:28 am
      Hey everyone, how have you been?
      I was able to put this to work the way I wanted, but now I can't seem to get rid of the PCRE error-10 message, which won't allow me to download images. I tried the [trigger] solution mentioned, but, even though the [jit.textfile] opens, it still says that there was this error.
      Can somebody help, please? Thanks!
    • Nov 13 2010 | 11:51 am
      Hello there friends,
      After some serious dabbling around, I am still faced with the PCRE-10 problem.
      I made up a patch that tries to replicate the problem. Please note that the trigger solution mentioned earlier in this thread is present, but no bangs are connected to it as of yet, so you can try both ways and, hopefully, help me get this done... :) I mean, it is really weird that everything works fine with some "queries" done to Google by this patch, downloading all the files needed, and not even try with other "tags"...
      Honest thanks, coming from honest work :)
    • Nov 13 2010 | 1:56 pm
      I didn't experience the error when using your patch but it didn't seem to be finding the images either, probably due to the change in the way images are displayed in the search results. It seems it uses javascript to reload images and canvas to display them. Here's a tiny patch which will print the URLs to the first 20 images for a given search term. It's up to you to patch that back in to [jit.uldl] to download them to your computer.
    • Nov 13 2010 | 8:56 pm
      Dear Luke, thank you so much! But I think your post missed the patch :) I would be really grateful to check out that new way of doing this you mentioned! Thanks, honestly. Thanks!
    • Nov 16 2010 | 9:40 pm
      Here's my patch again just in case it helps you out still:
    • Nov 20 2010 | 12:57 am
      Thank you dear Luke! As I had the chance to tell you, I got it working on my own, which was a boost in my confidence, but, ahah, opening your patch made it look so easy, that I ended up using a mix of both... Thanks! I'll be sure to show you what I came up with :)
      Hugs and many thanks!
    • Jan 16 2011 | 4:45 am
      Google comes back with results like 2,230,000 Thousand pictures of the Sun is it possible to create a patch that can download all of those. Or do you think Google would restrict something like that?
    • Jan 16 2011 | 1:48 pm
      Hi Eric, the only thing holding you back is disc space, time and max know-how. Google doesn't get a say in whether or not you can download anything.
    • Jan 16 2011 | 6:22 pm
      I think you are limited to the first x images that are shown on the first page. For the "bottle" example patch from Luke there are about 72,100,000 results. But when you start the download then it downloads only the first x images that are shown on the first page.
      I have done only a fast test, so it is maybe possible to download them all. I hope you have a nice provider and good internet connection to download 2,230,000 images of the Sun at ones. I will go for the moon. :)
    • Feb 10 2011 | 2:10 am
      Luke, your patch is great. Were you able to figure out how to display each image in Max? I'm trying to do something similar
    • Feb 10 2011 | 12:23 pm
      You could look in to using [jweb] to display them. Or you could route the URLs back in to [jit.uldl] to download the files and then "read" them into [fpic] or even use jitter. There's an example of how to download the files to your local drive in the "parser/downloader" subpatch in the [jit.str.regexp] help file.
    • Aug 24 2011 | 5:18 am
      Hi guys, I'm doing work for the University of La Plata and this post helped me a lot. But still do not understand too much programming is here.
      The first patch only allows me to download pictures of bottles and I could not fix that. Have you been able to fix this problem? It is very important for my work to fix this problem.
      Excuse my English, I am Argentinian and my language is Spanish. I await your responses.
      Thank you very much!
    • Aug 24 2011 | 11:29 pm
      What sort of images are you looking for? You can change the search URL and have the patch find other images. Here's the search string: http://www.google.com/images?hl=en-EN&q=bottle
      Change the last word to modify the search query.
    • Nov 01 2011 | 1:01 pm
      Right....i need to do this kind of thing and display images from a google search into a jitter window, so i've been messing with the patch above, with some success...although, knowing my programming skills, its very ungraceful
      what i WANT this to do, is get the list of image URLS from Google and then display a random one from that list every time its banged....
      if anyone could point me in the direction, or tidy up this patch and make it do that (and explain what you did) then you'll be on santas nice list and gain massive karma :)
    • Nov 01 2011 | 4:23 pm
      basically....if i could get a list out of the jit.str.regexp of all the URLs that i can then unpack into messages....i would be happy
    • Nov 01 2011 | 10:10 pm
      OK....so some more thinking over this and i THINK the solution is to use a coll object
      however....the way i've patched it up means that the symbols outputted from the jit.str.tosymbol don't seems to be going through the pack object into the coll
      any ideas why? something i'm missing?]
    • Nov 01 2011 | 10:46 pm
      here is a method to +1 the index using value.
    • Nov 01 2011 | 10:50 pm
      tighter patch
    • Nov 01 2011 | 10:53 pm
      would love to see the correct regexp for harvesting the subject line from Google search result page
    • Nov 02 2011 | 10:47 am
      Thanks so much, that has been so much help
    • Nov 02 2011 | 11:42 am
      It's not the most efficient expression but something like the following should work. I've also got a patch that will use the image query from the previous example to grab the referring page (not just the image URL) and then search for the page titles using another [jit.uldl] page request.
    • Nov 04 2011 | 9:02 am
      thanks Luke, u regexp jedi
    • Dec 02 2011 | 5:16 pm
      I would really like to find a way to grab images from a particular tumblr page and use in my interactive. Would jit.str.regexp be the way for me to do this?
    • Dec 03 2011 | 1:01 pm
      You'd need to download the source with [jit.uldl], store it for processing with [jit.textfile] and then search for the URLs with [jit.str.regexp] and then send this to another [jit.uldl] to download the files to your machine. See how far you get and if you get stuck you can always ask here and post your patch so far.
    • Dec 03 2011 | 1:21 pm
      Grabbing Source will not help you much as thumblr splits th content into pages.
      http://boutofcontext.com/tumblr_backup.php Start with this for getting a good source
      Pass results to a mass pic grabber such as can be found in firefox extension gallery Or use the jpg scanner/downloader found in the jit.uldl help file, or one Luke provided here a few patches ago
    • Dec 03 2011 | 11:47 pm
      Thanks guys really helpful points and a good first base for me to try out what I am thinking. I will see how I get on and may need to post back if I get stuck. Thanks again
    • May 01 2014 | 3:04 am
      I was just checking the jit.str.regexp help file example for downloading and parsing web pages. It is not working as it is. (page downloading but not images) I checked out the code posted in this thread and spent some time dissecting it but was not able to get it working. Anyone have any suggestions or updates to this thread?
    • May 01 2014 | 7:09 pm
      My patches where working with google but not anymore. This patch works still with the bing search engine:
      Something else: Copy Compressed in Max6 gives -begin_max5_patcher- in the pasted code?
    • Jul 20 2014 | 11:09 pm
      hey bitter, nice patch!
      do you have any idea how to make it that i can download all the pictures?
      right now its stops with 34. with my google picture scraper (that also stopped working, PCRE error -10) i could go from page to page to scrape them all.
      with the one window view in bing i can't find a workaround.
    • Aug 03 2014 | 1:26 pm
      Thanks Two weeks without internet and a few days between 1600 and 2000 meters above sea level. :) I was never in the need to download all pictures. Don't you think that downloading all images will be to much?
      Later on, i will see what i can do. Do you have found a solution?
    • Feb 04 2016 | 10:59 am
      How can you figure out "imgurl=([^&]+)&" from the text after downloading the address into jit.uldl? I think that one should know all the text before applying 'regular expression' to know which is matched or not. But the text is too long.
    • Feb 16 2016 | 9:38 pm
      I don't understand what you ask?
      The output text is long but possible to read.
    • Apr 02 2016 | 6:37 pm
      The patch doesn't work anymore.It doesn't retrieve the list. Any ideas? Thanks FC