Forums > Jitter

Web scraping google images

April 24, 2009 | 12:27 am

Hi everyone.

I want to do a project for school that uses jit.uldl and jit.str.regexp to scrape images from google images. I’ve been using the parser/downloader patch found in the jit.str.regexp help file as an example, but have had little luck. From my understanding, this patch (unedited) is suppose to download the html source from the c74 homepage, send the info as a matrix to jit.str.regexp which parses the html and extracts the gif and jpg files, reconfigures them back into a web address and then sends that back to jit.uldl to be downloaded. Is that right? Anyway, it doesn’t appear to be doing that, or I don’t know where the images are being downloaded to. As a note, I sometimes get the error jit.str.regexp: PCRE error -10 in the max window.

could anybody help me understand jit.str.exp better, and what exactly is going on in the jit.str.regexp parser/downloader example? does anybody have any good examples of web scraping that could help?

thank you so much!


April 24, 2009 | 12:42 am

NOTE: I am getting the error jit.str.regexp: PCRE error -10 EVERY time I try to run the parser/downloader patch. does anyone know what this means? It might solve my problem.



Ad.
April 27, 2009 | 10:30 pm

I had this problem too:
Here’s the soluce: you have to use a triger object: t b wclose open, connected to to jit.textfile. Apparently, if you don’t open the jit.textfile window, it can not work.

Ad.


April 28, 2009 | 1:38 am

works perfect! thank you!


May 3, 2009 | 9:17 am

this queries google images and downloads all the thumbnails. please let me know if you do anything interesting, i’d like to check it out.


July 22, 2010 | 4:15 am

Hi,

I am new to Max, my professor suggest I use it for a project I am working on. I want to enable a user to input a word and have Max search for that word in Google Image Search and return an image, for example "dog".


July 22, 2010 | 11:20 am

That is definitely possible in max. As it is school work I won’t tell you how to do it but work through the max tutorials to get to grips with the basics. Then take a look at the example patch in the [jit.str.regexp] help patch. It is called [parser/downloader]. The main objects for downloading the page source code are [jit.uldl] and [jit.textfile] for storing it. Then [jit.str.regexp] for iterating through the code to find the particular URL you want. Regular expressions can be a bit of a pain but there is lots of information on the forum already. One tip to get you started: the URL for a google image search of "dog" looks something like this: "http://www.google.com/images?q=dog".

lh


August 1, 2010 | 1:58 pm

Possible with max. But Google has again changed the way how it structured the URL/path so i have to change my patch again. Maybe this can help:

Look in which way the URL/path is structured.
Like for example:

http://images.google.be/images?hl=nl&gbv=2&biw=1436&bih=751&tbs=isch%3A1&sa=1&q=cat&btnG=Zoeken&aq=0&aqi=g10&aql=&oq=&gs_rfai=

&q=cat
so it will search for the word cat
… and so on


October 18, 2010 | 10:00 am

Hello everyone,

By tweaking around with [jit.str.regexp] I was able to download all the thumbnails from Google, with a certain word query. Now, how can I retrieve the freshly-downloaded images back in to Max and show them?

Thank you so much,

Cisne


October 19, 2010 | 10:28 am

Hey everyone, how have you been?

I was able to put this to work the way I wanted, but now I can’t seem to get rid of the PCRE error-10 message, which won’t allow me to download images. I tried the [trigger] solution mentioned, but, even though the [jit.textfile] opens, it still says that there was this error.

Can somebody help, please? Thanks!

Cisne


November 13, 2010 | 11:51 am

Hello there friends,

After some serious dabbling around, I am still faced with the PCRE-10 problem.

I made up a patch that tries to replicate the problem. Please note that the trigger solution mentioned earlier in this thread is present, but no bangs are connected to it as of yet, so you can try both ways and, hopefully, help me get this done… :)
I mean, it is really weird that everything works fine with some "queries" done to Google by this patch, downloading all the files needed, and not even try with other "tags"…

Honest thanks, coming from honest work :)

Cisne


November 13, 2010 | 1:56 pm

I didn’t experience the error when using your patch but it didn’t seem to be finding the images either, probably due to the change in the way images are displayed in the search results. It seems it uses javascript to reload images and canvas to display them. Here’s a tiny patch which will print the URLs to the first 20 images for a given search term. It’s up to you to patch that back in to [jit.uldl] to download them to your computer.


November 13, 2010 | 8:56 pm

Dear Luke, thank you so much! But I think your post missed the patch :) I would be really grateful to check out that new way of doing this you mentioned! Thanks, honestly. Thanks!


November 16, 2010 | 9:40 pm

Here’s my patch again just in case it helps you out still:

– Pasted Max Patch, click to expand. –

November 20, 2010 | 12:57 am

Thank you dear Luke! As I had the chance to tell you, I got it working on my own, which was a boost in my confidence, but, ahah, opening your patch made it look so easy, that I ended up using a mix of both… Thanks! I’ll be sure to show you what I came up with :)

Hugs and many thanks!

Cisne


January 16, 2011 | 4:45 am

Google comes back with results like 2,230,000 Thousand pictures of the Sun is it possible to create a patch that can download all of those. Or do you think Google would restrict something like that?


January 16, 2011 | 1:48 pm

Hi Eric, the only thing holding you back is disc space, time and max know-how. Google doesn’t get a say in whether or not you can download anything.


January 16, 2011 | 6:22 pm

I think you are limited to the first x images that are shown on the first page. For the "bottle" example patch from Luke there are about 72,100,000 results. But when you start the download then it downloads only the first x images that are shown on the first page.

I have done only a fast test, so it is maybe possible to download them all.
I hope you have a nice provider and good internet connection to download 2,230,000 images of the Sun at ones.
I will go for the moon. :)


February 10, 2011 | 2:10 am

Luke, your patch is great. Were you able to figure out how to display each image in Max? I’m trying to do something similar


February 10, 2011 | 12:23 pm

You could look in to using [jweb] to display them. Or you could route the URLs back in to [jit.uldl] to download the files and then "read" them into [fpic] or even use jitter. There’s an example of how to download the files to your local drive in the "parser/downloader" subpatch in the [jit.str.regexp] help file.


August 24, 2011 | 5:18 am

Hi guys, I’m doing work for the University of La Plata and this post helped me a lot. But still do not understand too much programming is here.

The first patch only allows me to download pictures of bottles and I could not fix that. Have you been able to fix this problem? It is very important for my work to fix this problem.

Excuse my English, I am Argentinian and my language is Spanish.
I await your responses.

Thank you very much!

Gaston


August 24, 2011 | 11:29 pm

What sort of images are you looking for? You can change the search URL and have the patch find other images. Here’s the search string: http://www.google.com/images?hl=en-EN&q=bottle

Change the last word to modify the search query.


November 1, 2011 | 1:01 pm

Right….i need to do this kind of thing and display images from a google search into a jitter window, so i’ve been messing with the patch above, with some success…although, knowing my programming skills, its very ungraceful

what i WANT this to do, is get the list of image URLS from Google and then display a random one from that list every time its banged….

if anyone could point me in the direction, or tidy up this patch and make it do that (and explain what you did) then you’ll be on santas nice list and gain massive karma :)

john

– Pasted Max Patch, click to expand. –

November 1, 2011 | 4:23 pm

basically….if i could get a list out of the jit.str.regexp of all the URLs that i can then unpack into messages….i would be happy


November 1, 2011 | 10:10 pm

OK….so some more thinking over this and i THINK the solution is to use a coll object

however….the way i’ve patched it up means that the symbols outputted from the jit.str.tosymbol don’t seems to be going through the pack object into the coll

any ideas why? something i’m missing?]

thanls

– Pasted Max Patch, click to expand. –

November 1, 2011 | 10:46 pm

here is a method to +1 the index using value.

– Pasted Max Patch, click to expand. –

November 1, 2011 | 10:50 pm

tighter patch

– Pasted Max Patch, click to expand. –

November 1, 2011 | 10:53 pm

would love to see the correct regexp for harvesting the subject line from Google search result page


November 2, 2011 | 10:47 am

Thanks so much, that has been so much help


November 2, 2011 | 11:42 am

It’s not the most efficient expression but something like the following should work. I’ve also got a patch that will use the image query from the previous example to grab the referring page (not just the image URL) and then search for the page titles using another [jit.uldl] page request.

– Pasted Max Patch, click to expand. –

November 4, 2011 | 9:02 am

thanks Luke, u regexp jedi


December 2, 2011 | 5:16 pm

I would really like to find a way to grab images from a particular tumblr page and use in my interactive. Would jit.str.regexp be the way for me to do this?


December 3, 2011 | 1:01 pm

You’d need to download the source with [jit.uldl], store it for processing with [jit.textfile] and then search for the URLs with [jit.str.regexp] and then send this to another [jit.uldl] to download the files to your machine. See how far you get and if you get stuck you can always ask here and post your patch so far.


December 3, 2011 | 1:21 pm

Grabbing Source will not help you much as thumblr splits th content into pages.

http://boutofcontext.com/tumblr_backup.php
Start with this for getting a good source

Pass results to a mass pic grabber such as can be found in firefox extension gallery
Or use the jpg scanner/downloader found in the jit.uldl help file, or one Luke provided here a few patches ago


December 3, 2011 | 11:47 pm

Thanks guys really helpful points and a good first base for me to try out what I am thinking. I will see how I get on and may need to post back if I get stuck.
Thanks again


April 30, 2014 | 8:04 pm

I was just checking the jit.str.regexp help file example for downloading and parsing web pages. It is not working as it is. (page downloading but not images) I checked out the code posted in this thread and spent some time dissecting it but was not able to get it working. Anyone have any suggestions or updates to this thread?


May 1, 2014 | 12:09 pm

My patches where working with google but not anymore.
This patch works still with the bing search engine:

– Pasted Max Patch, click to expand. –

Something else: Copy Compressed in Max6 gives -begin_max5_patcher- in the pasted code?

  • This reply was modified 6 months by  BITter.

July 20, 2014 | 4:09 pm

hey bitter, nice patch!

do you have any idea how to make it that i can download all the pictures?

right now its stops with 34. with my google picture scraper (that also stopped working, PCRE error -10) i could go from page to page to scrape them all.

with the one window view in bing i can’t find a workaround.


August 3, 2014 | 6:26 am

Thanks
Two weeks without internet and a few days between 1600 and 2000 meters above sea level. :)
I was never in the need to download all pictures.
Don’t you think that downloading all images will be to much?

Later on, i will see what i can do.
Do you have found a solution?


Viewing 39 posts - 1 through 39 (of 39 total)