Web scraping google images

Apr 24, 2009 at 12:27am

Web scraping google images

Hi everyone.

I want to do a project for school that uses jit.uldl and jit.str.regexp to scrape images from google images. I’ve been using the parser/downloader patch found in the jit.str.regexp help file as an example, but have had little luck. From my understanding, this patch (unedited) is suppose to download the html source from the c74 homepage, send the info as a matrix to jit.str.regexp which parses the html and extracts the gif and jpg files, reconfigures them back into a web address and then sends that back to jit.uldl to be downloaded. Is that right? Anyway, it doesn’t appear to be doing that, or I don’t know where the images are being downloaded to. As a note, I sometimes get the error jit.str.regexp: PCRE error -10 in the max window.

could anybody help me understand jit.str.exp better, and what exactly is going on in the jit.str.regexp parser/downloader example? does anybody have any good examples of web scraping that could help?

thank you so much!

#43513
Apr 24, 2009 at 12:42am

NOTE: I am getting the error jit.str.regexp: PCRE error -10 EVERY time I try to run the parser/downloader patch. does anyone know what this means? It might solve my problem.

#156118
Apr 27, 2009 at 10:30pm

I had this problem too:
Here’s the soluce: you have to use a triger object: t b wclose open, connected to to jit.textfile. Apparently, if you don’t open the jit.textfile window, it can not work.

Ad.

#156119
Apr 28, 2009 at 1:38am

works perfect! thank you!

#156120
May 3, 2009 at 9:17am

this queries google images and downloads all the thumbnails. please let me know if you do anything interesting, i’d like to check it out.

#156121
Jul 22, 2010 at 4:15am

Hi,

I am new to Max, my professor suggest I use it for a project I am working on. I want to enable a user to input a word and have Max search for that word in Google Image Search and return an image, for example “dog”.

#156122
Jul 22, 2010 at 11:20am

That is definitely possible in max. As it is school work I won’t tell you how to do it but work through the max tutorials to get to grips with the basics. Then take a look at the example patch in the [jit.str.regexp] help patch. It is called [parser/downloader]. The main objects for downloading the page source code are [jit.uldl] and [jit.textfile] for storing it. Then [jit.str.regexp] for iterating through the code to find the particular URL you want. Regular expressions can be a bit of a pain but there is lots of information on the forum already. One tip to get you started: the URL for a google image search of “dog” looks something like this: “http://www.google.com/images?q=dog”.

lh

#156123
Aug 1, 2010 at 1:58pm

Possible with max. But Google has again changed the way how it structured the URL/path so i have to change my patch again. Maybe this can help:

Look in which way the URL/path is structured.
Like for example:

http://images.google.be/images?hl=nl&gbv=2&biw=1436&bih=751&tbs=isch%3A1&sa=1&q=cat&btnG=Zoeken&aq=0&aqi=g10&aql=&oq=&gs_rfai=

&q=cat
so it will search for the word cat
… and so on

#156124
Oct 18, 2010 at 10:00am

Hello everyone,

By tweaking around with [jit.str.regexp] I was able to download all the thumbnails from Google, with a certain word query. Now, how can I retrieve the freshly-downloaded images back in to Max and show them?

Thank you so much,

Cisne

#156125
Oct 19, 2010 at 10:28am

Hey everyone, how have you been?

I was able to put this to work the way I wanted, but now I can’t seem to get rid of the PCRE error-10 message, which won’t allow me to download images. I tried the [trigger] solution mentioned, but, even though the [jit.textfile] opens, it still says that there was this error.

Can somebody help, please? Thanks!

Cisne

#156126
Nov 13, 2010 at 11:51am

Hello there friends,

After some serious dabbling around, I am still faced with the PCRE-10 problem.

I made up a patch that tries to replicate the problem. Please note that the trigger solution mentioned earlier in this thread is present, but no bangs are connected to it as of yet, so you can try both ways and, hopefully, help me get this done… :)
I mean, it is really weird that everything works fine with some “queries” done to Google by this patch, downloading all the files needed, and not even try with other “tags”…

Honest thanks, coming from honest work :)

Cisne

#156127
Nov 13, 2010 at 1:56pm

I didn’t experience the error when using your patch but it didn’t seem to be finding the images either, probably due to the change in the way images are displayed in the search results. It seems it uses javascript to reload images and canvas to display them. Here’s a tiny patch which will print the URLs to the first 20 images for a given search term. It’s up to you to patch that back in to [jit.uldl] to download them to your computer.

#156128
Nov 13, 2010 at 8:56pm

Dear Luke, thank you so much! But I think your post missed the patch :) I would be really grateful to check out that new way of doing this you mentioned! Thanks, honestly. Thanks!

#156129
Nov 16, 2010 at 9:40pm

Here’s my patch again just in case it helps you out still:

– Pasted Max Patch, click to expand. –
#156130
Nov 20, 2010 at 12:57am

Thank you dear Luke! As I had the chance to tell you, I got it working on my own, which was a boost in my confidence, but, ahah, opening your patch made it look so easy, that I ended up using a mix of both… Thanks! I’ll be sure to show you what I came up with :)

Hugs and many thanks!

Cisne

#156131
Jan 16, 2011 at 4:45am

Google comes back with results like 2,230,000 Thousand pictures of the Sun is it possible to create a patch that can download all of those. Or do you think Google would restrict something like that?

#156132
Jan 16, 2011 at 1:48pm

Hi Eric, the only thing holding you back is disc space, time and max know-how. Google doesn’t get a say in whether or not you can download anything.

#156133
Jan 16, 2011 at 6:22pm

I think you are limited to the first x images that are shown on the first page. For the “bottle” example patch from Luke there are about 72,100,000 results. But when you start the download then it downloads only the first x images that are shown on the first page.

I have done only a fast test, so it is maybe possible to download them all.
I hope you have a nice provider and good internet connection to download 2,230,000 images of the Sun at ones.
I will go for the moon. :)

#156134
Feb 10, 2011 at 2:10am

Luke, your patch is great. Were you able to figure out how to display each image in Max? I’m trying to do something similar

#156135
Feb 10, 2011 at 12:23pm

You could look in to using [jweb] to display them. Or you could route the URLs back in to [jit.uldl] to download the files and then “read” them into [fpic] or even use jitter. There’s an example of how to download the files to your local drive in the “parser/downloader” subpatch in the [jit.str.regexp] help file.

#156136
Aug 24, 2011 at 5:18am

Hi guys, I’m doing work for the University of La Plata and this post helped me a lot. But still do not understand too much programming is here.

The first patch only allows me to download pictures of bottles and I could not fix that. Have you been able to fix this problem? It is very important for my work to fix this problem.

Excuse my English, I am Argentinian and my language is Spanish.
I await your responses.

Thank you very much!

Gaston

#156137
Aug 24, 2011 at 11:29pm

What sort of images are you looking for? You can change the search URL and have the patch find other images. Here’s the search string: http://www.google.com/images?hl=en-EN&q=bottle

Change the last word to modify the search query.

#156138
Nov 1, 2011 at 1:01pm

Right….i need to do this kind of thing and display images from a google search into a jitter window, so i’ve been messing with the patch above, with some success…although, knowing my programming skills, its very ungraceful

what i WANT this to do, is get the list of image URLS from Google and then display a random one from that list every time its banged….

if anyone could point me in the direction, or tidy up this patch and make it do that (and explain what you did) then you’ll be on santas nice list and gain massive karma :)

john

– Pasted Max Patch, click to expand. –
#156139
Nov 1, 2011 at 4:23pm

basically….if i could get a list out of the jit.str.regexp of all the URLs that i can then unpack into messages….i would be happy

#156140
Nov 1, 2011 at 10:10pm

OK….so some more thinking over this and i THINK the solution is to use a coll object

however….the way i’ve patched it up means that the symbols outputted from the jit.str.tosymbol don’t seems to be going through the pack object into the coll

any ideas why? something i’m missing?]

thanls

– Pasted Max Patch, click to expand. –
#156141
Nov 1, 2011 at 10:46pm

here is a method to +1 the index using value.

– Pasted Max Patch, click to expand. –
#156142
Nov 1, 2011 at 10:50pm

tighter patch

– Pasted Max Patch, click to expand. –
#156143
Nov 1, 2011 at 10:53pm

would love to see the correct regexp for harvesting the subject line from Google search result page

#156144
Nov 2, 2011 at 10:47am

Thanks so much, that has been so much help

#156145
Nov 2, 2011 at 11:42am

It’s not the most efficient expression but something like the following should work. I’ve also got a patch that will use the image query from the previous example to grab the referring page (not just the image URL) and then search for the page titles using another [jit.uldl] page request.

– Pasted Max Patch, click to expand. –
#156146
Nov 4, 2011 at 9:02am

thanks Luke, u regexp jedi

#156147
Dec 2, 2011 at 5:16pm

I would really like to find a way to grab images from a particular tumblr page and use in my interactive. Would jit.str.regexp be the way for me to do this?

#156148
Dec 3, 2011 at 1:01pm

You’d need to download the source with [jit.uldl], store it for processing with [jit.textfile] and then search for the URLs with [jit.str.regexp] and then send this to another [jit.uldl] to download the files to your machine. See how far you get and if you get stuck you can always ask here and post your patch so far.

#156149
Dec 3, 2011 at 1:21pm

Grabbing Source will not help you much as thumblr splits th content into pages.

http://boutofcontext.com/tumblr_backup.php
Start with this for getting a good source

Pass results to a mass pic grabber such as can be found in firefox extension gallery
Or use the jpg scanner/downloader found in the jit.uldl help file, or one Luke provided here a few patches ago

#156150
Dec 3, 2011 at 11:47pm

Thanks guys really helpful points and a good first base for me to try out what I am thinking. I will see how I get on and may need to post back if I get stuck.
Thanks again

#156151

You must be logged in to reply to this topic.