Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping Google images with Python

I'm trying to learn Python scraping and came across a program to scrape a set number of images from Google image search results

I changed it to go for 5 images, it was working for a while but it stopped working recently with showing outputs such as there are 0 images

import requests
import re
import urllib2
import os
import cookielib
import json

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')


query = raw_input("query image")# you can change the query for the image  here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.com/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="C:\Users\mynam\Desktop\WB"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)


ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

print  "there are total" , len(ActualImages),"images"

if not os.path.exists(DIR):
            os.mkdir(DIR)
DIR = os.path.join(DIR, query.split()[0])

if not os.path.exists(DIR):
            os.mkdir(DIR)
###print images
for i , (img , Type) in enumerate(ActualImages[0:5]):
    try:
        req = urllib2.Request(img, headers={'User-Agent' : header})
        raw_img = urllib2.urlopen(req).read()

        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb')
        else :
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

There are no error logs, the file gets created and it is empty. The ActualImages array remains empty for some reason.

like image 691
shawnin damnen Avatar asked Feb 06 '20 13:02

shawnin damnen


People also ask

Can you web scrape images Python?

There are multiple ways of scraping images using Python, and for someone who has a nice to work with knowledge about Python basics, this is going to be particularly easy. Also note that only static websites can be scraped using the following technique. Dynamic websites can be scraped using a module called Selenium.

Is it OK to scrape data from Google results?

Can you scrape Google search results? Yes. You can scrape Google SERP by using Google Search Scraper tool.


2 Answers

it seems that Google has recently removed the metadata from the image search result, i.e. you won't find rg_meta in the HTML. Therefore, soup.find_all("div",{"class":"rg_meta"}): will not return anything.

I haven't found a solution for this. I believe Google made this change for the very purpose of preventing scraping.

like image 54
Densus Avatar answered Oct 20 '22 00:10

Densus


I haven't seen anyone mention this. Not an ideal solution but if you want something simple that works and doesn't take any hassle to setup you can use selenium. Since google seems to intentionally be preventing image scraping as Densus mentioned perhaps this would be inappropriate usage of selenium, I'm not sure.

There's plenty of public, working selenium google image scrapers on github that you can view and use. In fact, if you search for any recent python google image scraper on github I think most if not all of them will be selenium implementations.

For example: https://github.com/Msalmannasir/Google_image_scraper

This one, just download the chromium driver and update the filepath to it in the code.

like image 4
foerever Avatar answered Oct 20 '22 00:10

foerever