Scraping Google images with Python

Tags:

I'm trying to learn Python scraping and came across a program to scrape a set number of images from Google image search results

I changed it to go for 5 images, it was working for a while but it stopped working recently with showing outputs such as there are 0 images

import requests
import re
import urllib2
import os
import cookielib
import json

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')


query = raw_input("query image")# you can change the query for the image  here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.com/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="C:\Users\mynam\Desktop\WB"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)


ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

print  "there are total" , len(ActualImages),"images"

if not os.path.exists(DIR):
            os.mkdir(DIR)
DIR = os.path.join(DIR, query.split()[0])

if not os.path.exists(DIR):
            os.mkdir(DIR)
###print images
for i , (img , Type) in enumerate(ActualImages[0:5]):
    try:
        req = urllib2.Request(img, headers={'User-Agent' : header})
        raw_img = urllib2.urlopen(req).read()

        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb')
        else :
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

There are no error logs, the file gets created and it is empty. The ActualImages array remains empty for some reason.

691

asked Feb 06 '20 13:02

shawnin damnen

2 Answers

it seems that Google has recently removed the metadata from the image search result, i.e. you won't find rg_meta in the HTML. Therefore, soup.find_all("div",{"class":"rg_meta"}): will not return anything.

I haven't found a solution for this. I believe Google made this change for the very purpose of preventing scraping.

answered Oct 20 '22 00:10

Densus

I haven't seen anyone mention this. Not an ideal solution but if you want something simple that works and doesn't take any hassle to setup you can use selenium. Since google seems to intentionally be preventing image scraping as Densus mentioned perhaps this would be inappropriate usage of selenium, I'm not sure.

There's plenty of public, working selenium google image scrapers on github that you can view and use. In fact, if you search for any recent python google image scraper on github I think most if not all of them will be selenium implementations.

For example: https://github.com/Msalmannasir/Google_image_scraper

This one, just download the chromium driver and update the filepath to it in the code.

answered Oct 20 '22 00:10

foerever

Related questions
                            
                                Skip Flask logging for one endpoint?
                            
                                python: pycodestyle (ex pep8) vs pylint strictness
                            
                                Why would the loss decrease while the accuracy stays the same?
                            
                                Docker-compose logs are only showing "Attaching to" and nothing else
                            
                                Drop all rows in Pandas DataFrame where value is NOT NaN
                            
                                Django Background tasks vs Celery
                            
                                SQLAlchemy filter on list attribute
                            
                                How to change the order of x-axis labels in a seaborn lineplot? [duplicate]
                            
                                How to customize keyboard shortcuts in Jupyter Lab to run current line or selected text?
                            
                                Google App Engine gunicorn worker timeout in Flask app when loading a large pickle?
                            
                                How to group and highlight group of pixels in an image using OpenCV? [closed]
                            
                                Any way to speedup itertool.product
                            
                                Pass data between different views in Django
                            
                                Docker compose executable file not found in $PATH": unknown
                            
                                In what situation is an object not equal to itself?
                            
                                How to solve TypeError: on_delete must be callable on Django models?
                            
                                python based Dockerfile throws locale.Error: unsupported locale setting
                            
                                BERT tokenizer & model download
                            
                                Is there a way for pytest to check if a log entry was made at Error level or higher?
                            
                                What is the difference between numpy.fft.fft and numpy.fft.fftfreq

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping Google images with Python

Tags:

python

python-2.x

web-scraping

shawnin damnen

People also ask

2 Answers

Densus

foerever

Recent Activity

Donate For Us