I do some search to google images http://www.google.com/search?hl=en&q=panda&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&biw=1287&bih=672&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&tab=wi&ei=qW4FUJigJ4jWtAbToInABg and the result is thousands of photos. I am looking for a shell script that will download the first <code>n</code> images, for example 1000 or 500. How can I do this ? I guess I need some advanced regular expressions or something like that. I was trying many things but to no avail, can someone help me please?

I dont think you can achieve the entire task using regexes alone. There are 3 parts to this problem- 1.Extract the links of all the images -----> Cant be done with regexes. You need to use a web based language for this. Google has APIs to do this programatically. Check out here and here. 2.Assuming you succeeded in the first step with some web based language, you can use the following regex which uses lookaheads to extract the exact image URL <pre class="prettyprint"><code>(?<=imgurl=).*?(?=&) </code></pre> The above regex says - Grab everything starting after <code>imgurl=</code> and till you encounter the <code>&</code> symbol. See here for an example, where I took the URL of the first image of your search result and extracted the image URL. How did I arrive at the above regex? By examining the links of the images found in the image search. 3.Now that you've got the image URLs, use some web based language/tool to download your images.

Download first 1000 images from google search

2 Answers

I dont think you can achieve the entire task using regexes alone. There are 3 parts to this problem-

1.Extract the links of all the images -----> Cant be done with regexes. You need to use a web based language for this. Google has APIs to do this programatically. Check out here and here.

2.Assuming you succeeded in the first step with some web based language, you can use the following regex which uses lookaheads to extract the exact image URL

(?<=imgurl=).*?(?=&)

The above regex says - Grab everything starting after imgurl= and till you encounter the & symbol. See here for an example, where I took the URL of the first image of your search result and extracted the image URL.

How did I arrive at the above regex? By examining the links of the images found in the image search.

3.Now that you've got the image URLs, use some web based language/tool to download your images.

answered Sep 25 '22 18:09

Pavan Manjunath

update 4: PhantomJS is now obsolete, I made a new script google-images.py in Python using Selenium and Chrome headless. See here for more details: https://stackoverflow.com/a/61982397/218294

update 3: I fixed the script to work with phantomjs 2.x.

update 2: I modified the script to use phantomjs. It's harder to install, but at least it works again. http://sam.nipl.net/b/google-images http://sam.nipl.net/b/google-images.js

update 1: Unfortunately this no longer works. It seems Javascript and other magic is now required to find where the images are located. Here is a version of the script for yahoo image search: http://sam.nipl.net/code/nipl-tools/bin/yimg

original answer: I hacked something together for this. I normally write smaller tools and use them together, but you asked for one shell script, not three dozen. This is deliberately dense code.

http://sam.nipl.net/code/nipl-tools/bin/google-images

It seems to work very well so far. Please let me know if you can improve it, or suggest any better coding techniques (given that it's a shell script).

#!/bin/bash
[ $# = 0 ] && { prog=`basename "$0"`;
echo >&2 "usage: $prog query count parallel safe opts timeout tries agent1 agent2
e.g. : $prog ostrich
       $prog nipl 100 20 on isz:l,itp:clipart 5 10"; exit 2; }
query=$1 count=${2:-20} parallel=${3:-10} safe=$4 opts=$5 timeout=${6:-10} tries=${7:-2}
agent1=${8:-Mozilla/5.0} agent2=${9:-Googlebot-Image/1.0}
query_esc=`perl -e 'use URI::Escape; print uri_escape($ARGV[0]);' "$query"`
dir=`echo "$query_esc" | sed 's/%20/-/g'`; mkdir "$dir" || exit 2; cd "$dir"
url="http://www.google.com/search?tbm=isch&safe=$safe&tbs=$opts&q=$query_esc" procs=0
echo >.URL "$url" ; for A; do echo >>.args "$A"; done
htmlsplit() { tr '\n\r \t' ' ' | sed 's/</\n</g; s/>/>\n/g; s/\n *\n/\n/g; s/^ *\n//; s/ $//;'; }
for start in `seq 0 20 $[$count-1]`; do
wget -U"$agent1" -T"$timeout" --tries="$tries" -O- "$url&start=$start" | htmlsplit
done | perl -ne 'use HTML::Entities; /^<a .*?href="(.*?)"/ and print decode_entities($1), "\n";' | grep '/imgres?' |
perl -ne 'use URI::Escape; ($img, $ref) = map { uri_unescape($_) } /imgurl=(.*?)&imgrefurl=(.*?)&/;
$ext = $img; for ($ext) { s,.*[/.],,; s/[^a-z0-9].*//i; $_ ||= "img"; }
$save = sprintf("%04d.$ext", ++$i); print join("\t", $save, $img, $ref), "\n";' |
tee -a .images.tsv |
while IFS=$'\t' read -r save img ref; do
wget -U"$agent2" -T"$timeout" --tries="$tries" --referer="$ref" -O "$save" "$img" || rm "$save" &
procs=$[$procs + 1]; [ $procs = $parallel ] && { wait; procs=0; }
done ; wait

Features:

under 1500 bytes
explains usage, if run with no args
downloads full images in parallel
safe search option
image size, type, etc. opts string
timeout / retries options
impersonates googlebot to fetch all images
numbers image files
saves metadata

I'll post a modular version some time, to show that it can be done quite nicely with a set of shell scripts and simple tools.

182

answered Sep 26 '22 18:09

Sam Watkins

Related questions
                            
                                How to let regex ignore everything between brackets?
                            
                                Can we use regular expressions to check if there are an odd number of each type of character?
                            
                                Regex using word boundary but word ends with a . (period)
                            
                                Replace the spaces between multiple (3+) capital letters
                            
                                Scala: Matching optional Regular Expression groups
                            
                                Symfony2 how to allow slug with dashes in routes regex?
                            
                                Best way to store JS Regex capturing groups in array?
                            
                                IIS URL rewrite module url's to lowercase
                            
                                Java regex: newline + white space
                            
                                When enumerating a MatchCollection, why does var result in Object type rather than Match type?
                            
                                Java regex content between single quotes
                            
                                non-greedy multiline search in vim
                            
                                Regular expression for excluding file types .exe and .js
                            
                                Notepad++ and regex: how to UPPERCASE specific part of a string / find / replace
                            
                                JavaScript: how do I remove all the white spaces from a JSON string except the ones in the values?
                            
                                Find numbers after specific text in a string with RegEx
                            
                                Regex to match a pattern, but exclude a set of words
                            
                                .NET Regex dot character matches carriage return?
                            
                                JavaScript RegEx excluding certain word/phrase?
                            
                                python - regex search and findall

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Download first 1000 images from google search

Tags:

regex

shell

uri

image

Lukap

People also ask

2 Answers

Pavan Manjunath

Sam Watkins

Recent Activity

Donate For Us