Web scraping google images
Hi everyone.
I want to do a project for school that uses jit.uldl and jit.str.regexp to scrape images from google images. I've been using the parser/downloader patch found in the jit.str.regexp help file as an example, but have had little luck. From my understanding, this patch (unedited) is suppose to download the html source from the c74 homepage, send the info as a matrix to jit.str.regexp which parses the html and extracts the gif and jpg files, reconfigures them back into a web address and then sends that back to jit.uldl to be downloaded. Is that right? Anyway, it doesn't appear to be doing that, or I don't know where the images are being downloaded to. As a note, I sometimes get the error jit.str.regexp: PCRE error -10 in the max window.
could anybody help me understand jit.str.exp better, and what exactly is going on in the jit.str.regexp parser/downloader example? does anybody have any good examples of web scraping that could help?
thank you so much!
NOTE: I am getting the error jit.str.regexp: PCRE error -10 EVERY time I try to run the parser/downloader patch. does anyone know what this means? It might solve my problem.
I had this problem too:
Here's the soluce: you have to use a triger object: t b wclose open, connected to to jit.textfile. Apparently, if you don't open the jit.textfile window, it can not work.
Ad.
works perfect! thank you!
this queries google images and downloads all the thumbnails. please let me know if you do anything interesting, i'd like to check it out.
Hi,
I am new to Max, my professor suggest I use it for a project I am working on. I want to enable a user to input a word and have Max search for that word in Google Image Search and return an image, for example "dog".
That is definitely possible in max. As it is school work I won't tell you how to do it but work through the max tutorials to get to grips with the basics. Then take a look at the example patch in the [jit.str.regexp] help patch. It is called [parser/downloader]. The main objects for downloading the page source code are [jit.uldl] and [jit.textfile] for storing it. Then [jit.str.regexp] for iterating through the code to find the particular URL you want. Regular expressions can be a bit of a pain but there is lots of information on the forum already. One tip to get you started: the URL for a google image search of "dog" looks something like this: "http://www.google.com/images?q=dog".
lh
Possible with max. But Google has again changed the way how it structured the URL/path so i have to change my patch again. Maybe this can help:
Look in which way the URL/path is structured.
Like for example:
&q=cat
so it will search for the word cat
... and so on
Hello everyone,
By tweaking around with [jit.str.regexp] I was able to download all the thumbnails from Google, with a certain word query. Now, how can I retrieve the freshly-downloaded images back in to Max and show them?
Thank you so much,
Cisne
Hey everyone, how have you been?
I was able to put this to work the way I wanted, but now I can't seem to get rid of the PCRE error-10 message, which won't allow me to download images. I tried the [trigger] solution mentioned, but, even though the [jit.textfile] opens, it still says that there was this error.
Can somebody help, please? Thanks!
Cisne
Hello there friends,
After some serious dabbling around, I am still faced with the PCRE-10 problem.
I made up a patch that tries to replicate the problem. Please note that the trigger solution mentioned earlier in this thread is present, but no bangs are connected to it as of yet, so you can try both ways and, hopefully, help me get this done... :)
I mean, it is really weird that everything works fine with some "queries" done to Google by this patch, downloading all the files needed, and not even try with other "tags"...
Honest thanks, coming from honest work :)
Cisne
I didn't experience the error when using your patch but it didn't seem to be finding the images either, probably due to the change in the way images are displayed in the search results. It seems it uses javascript to reload images and canvas to display them. Here's a tiny patch which will print the URLs to the first 20 images for a given search term. It's up to you to patch that back in to [jit.uldl] to download them to your computer.
Dear Luke, thank you so much! But I think your post missed the patch :) I would be really grateful to check out that new way of doing this you mentioned! Thanks, honestly. Thanks!
Here's my patch again just in case it helps you out still:
Thank you dear Luke! As I had the chance to tell you, I got it working on my own, which was a boost in my confidence, but, ahah, opening your patch made it look so easy, that I ended up using a mix of both... Thanks! I'll be sure to show you what I came up with :)
Hugs and many thanks!
Cisne
Google comes back with results like 2,230,000 Thousand pictures of the Sun is it possible to create a patch that can download all of those. Or do you think Google would restrict something like that?
Hi Eric, the only thing holding you back is disc space, time and max know-how. Google doesn't get a say in whether or not you can download anything.
I think you are limited to the first x images that are shown on the first page. For the "bottle" example patch from Luke there are about 72,100,000 results. But when you start the download then it downloads only the first x images that are shown on the first page.
I have done only a fast test, so it is maybe possible to download them all.
I hope you have a nice provider and good internet connection to download 2,230,000 images of the Sun at ones.
I will go for the moon. :)
Luke, your patch is great. Were you able to figure out how to display each image in Max? I'm trying to do something similar
You could look in to using [jweb] to display them. Or you could route the URLs back in to [jit.uldl] to download the files and then "read" them into [fpic] or even use jitter. There's an example of how to download the files to your local drive in the "parser/downloader" subpatch in the [jit.str.regexp] help file.
Hi guys, I'm doing work for the University of La Plata and this post helped me a lot. But still do not understand too much programming is here.
The first patch only allows me to download pictures of bottles and I could not fix that. Have you been able to fix this problem? It is very important for my work to fix this problem.
Excuse my English, I am Argentinian and my language is Spanish.
I await your responses.
Thank you very much!
Gaston
What sort of images are you looking for? You can change the search URL and have the patch find other images. Here's the search string: http://www.google.com/images?hl=en-EN&q=bottle
Change the last word to modify the search query.
Right....i need to do this kind of thing and display images from a google search into a jitter window, so i've been messing with the patch above, with some success...although, knowing my programming skills, its very ungraceful
what i WANT this to do, is get the list of image URLS from Google and then display a random one from that list every time its banged....
if anyone could point me in the direction, or tidy up this patch and make it do that (and explain what you did) then you'll be on santas nice list and gain massive karma :)
john
basically....if i could get a list out of the jit.str.regexp of all the URLs that i can then unpack into messages....i would be happy
OK....so some more thinking over this and i THINK the solution is to use a coll object
however....the way i've patched it up means that the symbols outputted from the jit.str.tosymbol don't seems to be going through the pack object into the coll
any ideas why? something i'm missing?]
thanls
here is a method to +1 the index using value.
tighter patch
would love to see the correct regexp for harvesting the subject line from Google search result page
Thanks so much, that has been so much help
It's not the most efficient expression but something like the following should work. I've also got a patch that will use the image query from the previous example to grab the referring page (not just the image URL) and then search for the page titles using another [jit.uldl] page request.
thanks Luke, u regexp jedi
I would really like to find a way to grab images from a particular tumblr page and use in my interactive. Would jit.str.regexp be the way for me to do this?
You'd need to download the source with [jit.uldl], store it for processing with [jit.textfile] and then search for the URLs with [jit.str.regexp] and then send this to another [jit.uldl] to download the files to your machine. See how far you get and if you get stuck you can always ask here and post your patch so far.
Grabbing Source will not help you much as thumblr splits th content into pages.
http://boutofcontext.com/tumblr_backup.php
Start with this for getting a good source
Pass results to a mass pic grabber such as can be found in firefox extension gallery
Or use the jpg scanner/downloader found in the jit.uldl help file, or one Luke provided here a few patches ago
Thanks guys really helpful points and a good first base for me to try out what I am thinking. I will see how I get on and may need to post back if I get stuck.
Thanks again
I was just checking the jit.str.regexp help file example for downloading and parsing web pages. It is not working as it is. (page downloading but not images) I checked out the code posted in this thread and spent some time dissecting it but was not able to get it working. Anyone have any suggestions or updates to this thread?
My patches where working with google but not anymore.
This patch works still with the bing search engine:
Something else: Copy Compressed in Max6 gives -begin_max5_patcher- in the pasted code?
hey bitter, nice patch!
do you have any idea how to make it that i can download all the pictures?
right now its stops with 34. with my google picture scraper (that also stopped working, PCRE error -10) i could go from page to page to scrape them all.
with the one window view in bing i can't find a workaround.
Thanks
Two weeks without internet and a few days between 1600 and 2000 meters above sea level. :)
I was never in the need to download all pictures.
Don't you think that downloading all images will be to much?
Later on, i will see what i can do.
Do you have found a solution?
How can you figure out "imgurl=([^&]+)&" from the text after downloading the address into jit.uldl? I think that one should know all the text before applying 'regular expression' to know which is matched or not. But the text is too long.
I don't understand what you ask?
The output text is long but possible to read.
The patch doesn't work anymore.It doesn't retrieve the list. Any ideas?
Thanks
FC