How to scrape more efficiently with Urllib2?

Question

Newbie here. I wrote a simple script using urllib2 to go through Billboard.com and scrape the top song and artist for each week from 1958 to 2013. The problem is it's very slow -- it would take a few hours to complete.

I'm wondering where the bottleneck is and if there's a way to scrape more efficiently with Urllib2 or if I need to use a more sophisticated tool?

import re
import urllib2
array = []
url = 'http://www.billboard.com/charts/1958-08-09/hot-100'
date = ""
while date != '2013-07-13':
    response = urllib2.urlopen(url)
    htmlText = response.read()
    date = re.findall('\d\d\d\d-\d\d-\d\d',url)[0]
    song = re.findall('<h1>.*</h1>', htmlText)[0]
    song = song[4:-5]
    artist = re.findall('/artist.*</a>', htmlText)[1]
    artist = re.findall('>.*<', artist)[0]
    artist = artist[1:-1]
    nextWeek = re.findall('href.*>Next', htmlText)[0]
    nextWeek = nextWeek[5:-5]
    array.append([date, song, artist])
    url = 'http://www.billboard.com' + nextWeek
print array

Brendan Wood · Accepted Answer

Your bottleneck is almost certainly in getting the data from the website. There is latency for each network request, which blocks anything else from happening in the meantime. You should consider splitting up requests across multiple threads so that you can send more than one request at a time. Basically, your performance here is I/O-bound, not CPU-bound.

Here's a simple solution built from the ground up so that you can see how crawlers generally work. Using something like Scrapy might be best in the long run, but I find it always helps to start with something simple and explicit.

import threading
import Queue
import time
import datetime
import urllib2
import re

class Crawler(threading.Thread):
    def __init__(self, thread_id, queue):
        threading.Thread.__init__(self)
        self.thread_id = thread_id
        self.queue = queue

        # let's use threading events to tell the thread when to exit
        self.stop_request = threading.Event()

    # this is the function which will run when the thread is started
    def run(self):
        print 'Hello from thread %d! Starting crawling...' % self.thread_id

        while not self.stop_request.isSet():
            # main crawl loop

            try:
                # attempt to get a url target from the queue
                url = self.queue.get_nowait()
            except Queue.Empty:
                # if there's nothing on the queue, sleep and continue
                time.sleep(0.01)
                continue

            # we got a url, so let's scrape it!
            response = urllib2.urlopen(url) # might want to consider adding a timeout here
            htmlText = response.read()

            # scraping with regex blows.
            # consider using xpath after parsing the html using lxml.html module
            song = re.findall('<h1>.*</h1>', htmlText)[0]
            song = song[4:-5]
            artist = re.findall('/artist.*</a>', htmlText)[1]
            artist = re.findall('>.*<', artist)[0]
            artist = artist[1:-1]

            print 'thread %d found artist:', (self.thread_id, artist)

    # we're overriding the default join function for the thread so
    # that we can make sure it stops
    def join(self, timeout=None):
        self.stop_request.set()
        super(Crawler, self).join(timeout)

if __name__ == '__main__':
    # how many threads do you want?  more is faster, but too many
    # might get your IP blocked or even bring down the site (DoS attack)
    n_threads = 10

    # use a standard queue object (thread-safe) for communication
    queue = Queue.Queue()

    # create our threads
    threads = []
    for i in range(n_threads):
        threads.append(Crawler(i, queue))

    # generate the urls and fill the queue
    url_template = 'http://www.billboard.com/charts/%s/hot-100'
    start_date = datetime.datetime(year=1958, month=8, day=9)
    end_date = datetime.datetime(year=1959, month=9, day=5)
    delta = datetime.timedelta(weeks=1)

    week = 0
    date = start_date + delta*week
    while date <= end_date:
        url = url_template % date.strftime('%Y-%m-%d')
        queue.put(url)
        week += 1
        date = start_date + delta*week

    # start crawling!
    for t in threads:
        t.start()

    # wait until the queue is empty
    while not queue.empty():
        time.sleep(0.01)

    # kill the threads
    for t in threads:
        t.join()

alecxe · Answer

Here's a solution using Scrapy. Take a look at the overview and you'll understand that it's the tool that is designed for this kind of task:

it is fast (based on twisted)
easy to use and understand
built-in extracting mechanism based on xpaths (you can use bs or lxml too though)
built-in support for pipelining extracted items to the database, xml, json whatever
and a lot more features

Here's working spider that extracts everything you were asking (worked for 15 minutes on my, rather old, laptop):

import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class BillBoardItem(Item):
    date = Field()
    song = Field()
    artist = Field()


BASE_URL = "http://www.billboard.com/charts/%s/hot-100"


class BillBoardSpider(BaseSpider):
    name = "billboard_spider"
    allowed_domains = ["billboard.com"]

    def __init__(self):
        date = datetime.date(year=1958, month=8, day=9)

        self.start_urls = []
        while True:
            if date.year >= 2013:
                break

            self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
            date += datetime.timedelta(days=7)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]

        songs = hxs.select('//div[@class="listing chart_listing"]/article')
        for song in songs:
            item = BillBoardItem()
            item['date'] = date
            try:
                item['song'] = song.select('.//header/h1/text()').extract()[0]
                item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
            except:
                continue

            yield item

Save it into billboard.py and run via scrapy runspider billboard.py -o output.json. Then, in the output.json you'll see:

...
{"date": "September 20, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"}
{"date": "September 20, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"}
{"date": "September 20, 1958", "artist": "The Elegants", "song": "Little Star"}
{"date": "September 20, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"}
{"date": "September 20, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"}
{"date": "September 20, 1958", "artist": "Poni-Tails", "song": "Born Too Late"}
{"date": "September 20, 1958", "artist": "The Olympics", "song": "Western Movies"}
{"date": "September 20, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"}
{"date": "September 20, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"}
{"date": "September 27, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"}
{"date": "September 27, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"}
{"date": "September 27, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"}
{"date": "September 27, 1958", "artist": "The Elegants", "song": "Little Star"}
{"date": "September 27, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"}
{"date": "September 27, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"}
{"date": "September 27, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"}
...

Also, take a look at grequests as an alternative tool.

Hope that helps.

How to scrape more efficiently with Urllib2?

Tags:

python

html

regex

html-parsing

web-scraping

Ben Davidow

2 Answers

Brendan Wood

alecxe

Recent Activity

Donate For Us

How to scrape more efficiently with Urllib2?

Tags:

python

html

regex

html-parsing

web-scraping

Ben Davidow

2 Answers

Brendan Wood

alecxe

Related questions

Recent Activity

Donate For Us