How to speed up web scraping in python

Tags:

I'm working on a project for school and I am trying to get data about movies. I've managed to write a script to get the data I need from IMDbPY and Open Movie DB API (omdbapi.com). The challenge I'm experiencing is that I'm trying to get data for 22,305 movies and each request takes about 0.7 seconds. Essentially my current script will take about 8 hours to complete. Looking for any way to maybe use multiple requests at the same time or any other suggestions to significantly speed up the process of getting this data.

import urllib2
import json
import pandas as pd
import time
import imdb

start_time = time.time() #record time at beginning of script

#used to make imdb.com think we are getting this data from a browser
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }

#Open Movie Database Query url for IMDb IDs
url = 'http://www.omdbapi.com/?tomatoes=true&i='

#read the ids from the imdb_id csv file
imdb_ids = pd.read_csv('ids.csv')

cols = [u'Plot', u'Rated', u'tomatoImage', u'Title', u'DVD', u'tomatoMeter',
 u'Writer', u'tomatoUserRating', u'Production', u'Actors', u'tomatoFresh',
 u'Type', u'imdbVotes', u'Website', u'tomatoConsensus', u'Poster', u'tomatoRotten',
 u'Director', u'Released', u'tomatoUserReviews', u'Awards', u'Genre', u'tomatoUserMeter',
 u'imdbRating', u'Language', u'Country', u'imdbpy_budget', u'BoxOffice', u'Runtime',
 u'tomatoReviews', u'imdbID', u'Metascore', u'Response', u'tomatoRating', u'Year',
 u'imdbpy_gross']

#create movies dataframe
movies = pd.DataFrame(columns=cols)

i=0
for i in range(len(imdb_ids)-1):

    start = time.time()
    req = urllib2.Request(url + str(imdb_ids.ix[i,0]), None, headers) #request page
    response = urllib2.urlopen(req) #actually call the html request
    the_page = response.read() #read the json from the omdbapi query
    movie_json = json.loads(the_page) #convert the json to a dict

    #get the gross revenue and budget from IMDbPy
    data = imdb.IMDb()
    movie_id = imdb_ids.ix[i,['imdb_id']]
    movie_id = movie_id.to_string()
    movie_id = int(movie_id[-7:])
    data = data.get_movie_business(movie_id)
    data = data['data']
    data = data['business']

    #get the budget $ amount out of the budget IMDbPy string
    try:
        budget = data['budget']
        budget = budget[0]
        budget = budget.replace('$', '')
        budget = budget.replace(',', '')
        budget = budget.split(' ')
        budget = str(budget[0]) 
    except:
        None

    #get the gross $ amount out of the gross IMDbPy string
    try:
        budget = data['budget']
        budget = budget[0]
        budget = budget.replace('$', '')
        budget = budget.replace(',', '')
        budget = budget.split(' ')
        budget = str(budget[0])

        #get the gross $ amount out of the gross IMDbPy string
        gross = data['gross']
        gross = gross[0]
        gross = gross.replace('$', '')
        gross = gross.replace(',', '')
        gross = gross.split(' ')
        gross = str(gross[0])
    except:
        None

    #add gross to the movies dict 
    try:
        movie_json[u'imdbpy_gross'] = gross
    except:
        movie_json[u'imdbpy_gross'] = 0

    #add gross to the movies dict    
    try:
        movie_json[u'imdbpy_budget'] = budget
    except:
        movie_json[u'imdbpy_budget'] = 0

    #create new dataframe that can be merged to movies DF    
    tempDF = pd.DataFrame.from_dict(movie_json, orient='index')
    tempDF = tempDF.T

    #add the new movie to the movies dataframe
    movies = movies.append(tempDF, ignore_index=True)
    end = time.time()
    time_took = round(end-start, 2)
    percentage = round(((i+1) / float(len(imdb_ids))) * 100,1)
    print i+1,"of",len(imdb_ids),"(" + str(percentage)+'%)','completed',time_took,'sec'
    #increment counter
    i+=1  

#save the dataframe to a csv file            
movies.to_csv('movie_data.csv', index=False)
end_time = time.time()
print round((end_time-start_time)/60,1), "min"

866

asked May 04 '14 18:05

nbitting

1 Answers

Use Eventlet library to fetch concurently

As advised in comments, you shall fetch your feeds concurrently. This can be done by using treading, multiprocessing, or using eventlet.

Install eventlet

$ pip install eventlet

Try web crawler sample from `eventlet`

See: http://eventlet.net/doc/examples.html#web-crawler

Understanding concurrency with `eventlet`

With threading system takes care of switching between your threads. This brings big problem in case you have to access some common data structures, as you never know, which other thread is currently accessing your data. You then start playing with synchronized blocks, locks, semaphores - just to synchronize access to your shared data structures.

With eventlet it goes much simpler - you always run only one thread and jump between them only at I/O instructions or at other eventlet calls. The rest of your code runs uninterrupted and without a risk, another thread would mess up with our data.

You only have to take care of following:

all I/O operations must be non-blocking (this is mostly easy, eventlet provides non-blocking versions for most of the I/O you need).
your remaining code must not be CPU expensive as it would block switching between "green" threads for longer time and the power of "green" multithreading would be gone.

Great advantage with eventlet is, that it allows to write code in straightforward way without spoiling it (too) much with Locks, Semaphores etc.

Apply `eventlet` to your code

If I understand it correctly, list of urls to fetch is known in advance and order of their processing in your analysis is not important. This shall allow almost direct copy of example from eventlet. I see, that an index i has some significance, so you might consider mixing url and the index as a tuple and processing them as independent jobs.

There are definitely other methods, but personally I have found eventlet really easy to use comparing it to other techniques while getting really good results (especially with fetching feeds). You just have to grasp main concepts and be a bit careful to follow eventlet requirements (keep being non-blocking).

Fetching urls using requests and eventlet - erequests

There are various packages for asynchronous processing with requests, one of them using eventlet and being namederequests see https://github.com/saghul/erequests

Simple sample fetching set of urls

import erequests

# have list of urls to fetch
urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]
# erequests.async.get(url) creates asynchronous request
async_reqs = [erequests.async.get(url) for url in urls]
# each async request is ready to go, but not yet performed

# erequests.map will call each async request to the action
# what returns processed request `req`
for req in erequests.map(async_reqs):
    if req.ok:
        content = req.content
        # process it here
        print "processing data from:", req.url

Problems for processing this specific question

We are able to fetch and somehow process all urls we need. But in this question, processing is bound to particular record in source data, so we will need to match processed request with index of record we need for getting further details for final processing.

As we will see later, asynchronous processing does not honour order of requests, some are processed sooner and some later and map yields whatever is completed.

One option is to attach index of given url to the requests and use it later when processing returned data.

Complex sample of fetching and processing urls with preserving url indices

Note: following sample is rather complex, if you can live with solution provided above, skip this. But make sure you are not running into problems detected and resolved below (urls being modified, requests following redirects).

import erequests
from itertools import count, izip
from functools import partial

urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]

def print_url_index(index, req, *args, **kwargs):
    content_length = req.headers.get("content-length", None)
    todo = "PROCESS" if req.status_code == 200 else "WAIT, NOT YET READY"
    print "{todo}: index: {index}: status: {req.status_code}: length: {content_length}, {req.url}".format(**locals())

async_reqs = (erequests.async.get(url, hooks={"response": partial(print_url_index, i)}) for i, url in izip(count(), urls))

for req in erequests.map(async_reqs):
    pass

Attaching hooks to request

requests (and erequests too) allows defining hooks to event called response. Each time, the request gets a response, this hook function is called and can do something or even modify the response.

Following line defines some hook to response:

erequests.async.get(url, hooks={"response": partial(print_url_index, i)})

Passing url index to the hook function

Signature of any hook shall be func(req, *args, *kwargs)

But we need to pass into the hook function also the index of url we are processing.

For this purpose we use functools.partial which allows creation of simplified functions by fixing some of parameters to specific value. This is exactly what we need, if you see print_url_index signature, we need just to fix value of index, the rest will fit requirements for hook function.

In our call we use partial with name of simplified function print_url_index and providing for each url unique index of it.

Index could be provided in the loop by enumerate, in case of larger number of parameters we may work more memory efficient way and use count, which generates each time incremented number starting by default from 0.

Let us run it:

$ python ereq.py
WAIT, NOT YET READY: index: 3: status: 301: length: 66, http://python-requests.org/
WAIT, NOT YET READY: index: 4: status: 301: length: 58, http://kennethreitz.com/
WAIT, NOT YET READY: index: 0: status: 301: length: None, http://www.heroku.com/
PROCESS: index: 2: status: 200: length: 7700, http://httpbin.org/
WAIT, NOT YET READY: index: 1: status: 301: length: 64, http://python-tablib.org/
WAIT, NOT YET READY: index: 4: status: 301: length: None, http://kennethreitz.org
WAIT, NOT YET READY: index: 3: status: 302: length: 0, http://docs.python-requests.org
WAIT, NOT YET READY: index: 1: status: 302: length: 0, http://docs.python-tablib.org
PROCESS: index: 3: status: 200: length: None, http://docs.python-requests.org/en/latest/
PROCESS: index: 1: status: 200: length: None, http://docs.python-tablib.org/en/latest/
PROCESS: index: 0: status: 200: length: 12064, https://www.heroku.com/
PROCESS: index: 4: status: 200: length: 10478, http://www.kennethreitz.org/

This shows, that:

requests are not processed in the order they were generated
some requests follow redirection, so hook function is called multiple times
carefully inspecting url values we can see, that no url from original list urls is reported by response, even for index 2 we got extra / appended. That is why simple lookup of response url in original list of urls would not help us.

answered Sep 18 '22 17:09

Jan Vlcinsky

Related questions
                            
                                Finding equality between different strings that should be equal
                            
                                After Pandas Dataframe pd.concat I get NaNs
                            
                                PCA of RGB Image
                            
                                PyQt: How to hide QMainWindow
                            
                                PyDrive and Google Drive - automate verification process?
                            
                                Anaconda doesn't find module cv2
                            
                                Django FileField storage option
                            
                                pandas groupby last n
                            
                                Python - Task Scheduler 0x1
                            
                                Order list by date (String and datetime)
                            
                                How to trim a string: begin immediately after the first full-stop and end at the last
                            
                                Lowering process priority of multiprocessing.Pool on Windows
                            
                                How to apply decorator to every Flask view
                            
                                Multiple monitors with Kivy
                            
                                Confused about the proper order of try and with in Python
                            
                                Limit bokeh plot pan to defined range
                            
                                Disable simultaneous login from multiple different places on Flask-Login
                            
                                Python - ctypes - How to call functions and access struct fields?
                            
                                Print current logging level
                            
                                In Python, can I hide a base class's members?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to speed up web scraping in python

Tags:

python

web-scraping

nbitting

People also ask

1 Answers

Use Eventlet library to fetch concurently

Install eventlet

Try web crawler sample from `eventlet`

Understanding concurrency with `eventlet`

Apply `eventlet` to your code

Fetching urls using requests and eventlet - erequests

Simple sample fetching set of urls

Problems for processing this specific question

Complex sample of fetching and processing urls with preserving url indices

Attaching hooks to request

Passing url index to the hook function

Let us run it:

Jan Vlcinsky

Recent Activity

Donate For Us

How to speed up web scraping in python

Tags:

python

web-scraping

nbitting

People also ask

1 Answers

Use Eventlet library to fetch concurently

Install eventlet

Try web crawler sample from eventlet

Understanding concurrency with eventlet

Apply eventlet to your code

Fetching urls using requests and eventlet - erequests

Simple sample fetching set of urls

Problems for processing this specific question

Complex sample of fetching and processing urls with preserving url indices

Attaching hooks to request

Passing url index to the hook function

Let us run it:

Jan Vlcinsky

Related questions

Recent Activity

Donate For Us

Try web crawler sample from `eventlet`

Understanding concurrency with `eventlet`

Apply `eventlet` to your code