I'm trying to learn Python scraping and came across a program to scrape a set number of images from Google image search results
I changed it to go for 5 images, it was working for a while but it stopped working recently with showing outputs such as there are 0 images
import requests
import re
import urllib2
import os
import cookielib
import json
def get_soup(url,header):
return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')
query = raw_input("query image")# you can change the query for the image here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.com/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="C:\Users\mynam\Desktop\WB"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)
ActualImages=[]# contains the link for Large original images, type of image
for a in soup.find_all("div",{"class":"rg_meta"}):
link , Type =json.loads(a.text)["ou"] ,json.loads(a.text)["ity"]
ActualImages.append((link,Type))
print "there are total" , len(ActualImages),"images"
if not os.path.exists(DIR):
os.mkdir(DIR)
DIR = os.path.join(DIR, query.split()[0])
if not os.path.exists(DIR):
os.mkdir(DIR)
###print images
for i , (img , Type) in enumerate(ActualImages[0:5]):
try:
req = urllib2.Request(img, headers={'User-Agent' : header})
raw_img = urllib2.urlopen(req).read()
cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
print cntr
if len(Type)==0:
f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb')
else :
f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb')
f.write(raw_img)
f.close()
except Exception as e:
print "could not load : "+img
print e
There are no error logs, the file gets created and it is empty. The ActualImages
array remains empty for some reason.
There are multiple ways of scraping images using Python, and for someone who has a nice to work with knowledge about Python basics, this is going to be particularly easy. Also note that only static websites can be scraped using the following technique. Dynamic websites can be scraped using a module called Selenium.
Can you scrape Google search results? Yes. You can scrape Google SERP by using Google Search Scraper tool.
it seems that Google has recently removed the metadata from the image search result, i.e. you won't find rg_meta
in the HTML. Therefore, soup.find_all("div",{"class":"rg_meta"}):
will not return anything.
I haven't found a solution for this. I believe Google made this change for the very purpose of preventing scraping.
I haven't seen anyone mention this. Not an ideal solution but if you want something simple that works and doesn't take any hassle to setup you can use selenium. Since google seems to intentionally be preventing image scraping as Densus mentioned perhaps this would be inappropriate usage of selenium, I'm not sure.
There's plenty of public, working selenium google image scrapers on github that you can view and use. In fact, if you search for any recent python google image scraper on github I think most if not all of them will be selenium implementations.
For example: https://github.com/Msalmannasir/Google_image_scraper
This one, just download the chromium driver and update the filepath to it in the code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With