Newbie here. I wrote a simple script using urllib2 to go through Billboard.com and scrape the top song and artist for each week from 1958 to 2013. The problem is it's very slow -- it would take a few hours to complete.
I'm wondering where the bottleneck is and if there's a way to scrape more efficiently with Urllib2 or if I need to use a more sophisticated tool?
import re
import urllib2
array = []
url = 'http://www.billboard.com/charts/1958-08-09/hot-100'
date = ""
while date != '2013-07-13':
response = urllib2.urlopen(url)
htmlText = response.read()
date = re.findall('\d\d\d\d-\d\d-\d\d',url)[0]
song = re.findall('<h1>.*</h1>', htmlText)[0]
song = song[4:-5]
artist = re.findall('/artist.*</a>', htmlText)[1]
artist = re.findall('>.*<', artist)[0]
artist = artist[1:-1]
nextWeek = re.findall('href.*>Next', htmlText)[0]
nextWeek = nextWeek[5:-5]
array.append([date, song, artist])
url = 'http://www.billboard.com' + nextWeek
print array
Your bottleneck is almost certainly in getting the data from the website. There is latency for each network request, which blocks anything else from happening in the meantime. You should consider splitting up requests across multiple threads so that you can send more than one request at a time. Basically, your performance here is I/O-bound, not CPU-bound.
Here's a simple solution built from the ground up so that you can see how crawlers generally work. Using something like Scrapy might be best in the long run, but I find it always helps to start with something simple and explicit.
import threading
import Queue
import time
import datetime
import urllib2
import re
class Crawler(threading.Thread):
def __init__(self, thread_id, queue):
threading.Thread.__init__(self)
self.thread_id = thread_id
self.queue = queue
# let's use threading events to tell the thread when to exit
self.stop_request = threading.Event()
# this is the function which will run when the thread is started
def run(self):
print 'Hello from thread %d! Starting crawling...' % self.thread_id
while not self.stop_request.isSet():
# main crawl loop
try:
# attempt to get a url target from the queue
url = self.queue.get_nowait()
except Queue.Empty:
# if there's nothing on the queue, sleep and continue
time.sleep(0.01)
continue
# we got a url, so let's scrape it!
response = urllib2.urlopen(url) # might want to consider adding a timeout here
htmlText = response.read()
# scraping with regex blows.
# consider using xpath after parsing the html using lxml.html module
song = re.findall('<h1>.*</h1>', htmlText)[0]
song = song[4:-5]
artist = re.findall('/artist.*</a>', htmlText)[1]
artist = re.findall('>.*<', artist)[0]
artist = artist[1:-1]
print 'thread %d found artist:', (self.thread_id, artist)
# we're overriding the default join function for the thread so
# that we can make sure it stops
def join(self, timeout=None):
self.stop_request.set()
super(Crawler, self).join(timeout)
if __name__ == '__main__':
# how many threads do you want? more is faster, but too many
# might get your IP blocked or even bring down the site (DoS attack)
n_threads = 10
# use a standard queue object (thread-safe) for communication
queue = Queue.Queue()
# create our threads
threads = []
for i in range(n_threads):
threads.append(Crawler(i, queue))
# generate the urls and fill the queue
url_template = 'http://www.billboard.com/charts/%s/hot-100'
start_date = datetime.datetime(year=1958, month=8, day=9)
end_date = datetime.datetime(year=1959, month=9, day=5)
delta = datetime.timedelta(weeks=1)
week = 0
date = start_date + delta*week
while date <= end_date:
url = url_template % date.strftime('%Y-%m-%d')
queue.put(url)
week += 1
date = start_date + delta*week
# start crawling!
for t in threads:
t.start()
# wait until the queue is empty
while not queue.empty():
time.sleep(0.01)
# kill the threads
for t in threads:
t.join()
Here's a solution using Scrapy. Take a look at the overview and you'll understand that it's the tool that is designed for this kind of task:
bs or lxml too though)Here's working spider that extracts everything you were asking (worked for 15 minutes on my, rather old, laptop):
import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class BillBoardItem(Item):
date = Field()
song = Field()
artist = Field()
BASE_URL = "http://www.billboard.com/charts/%s/hot-100"
class BillBoardSpider(BaseSpider):
name = "billboard_spider"
allowed_domains = ["billboard.com"]
def __init__(self):
date = datetime.date(year=1958, month=8, day=9)
self.start_urls = []
while True:
if date.year >= 2013:
break
self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
date += datetime.timedelta(days=7)
def parse(self, response):
hxs = HtmlXPathSelector(response)
date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]
songs = hxs.select('//div[@class="listing chart_listing"]/article')
for song in songs:
item = BillBoardItem()
item['date'] = date
try:
item['song'] = song.select('.//header/h1/text()').extract()[0]
item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
except:
continue
yield item
Save it into billboard.py and run via scrapy runspider billboard.py -o output.json. Then, in the output.json you'll see:
...
{"date": "September 20, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"}
{"date": "September 20, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"}
{"date": "September 20, 1958", "artist": "The Elegants", "song": "Little Star"}
{"date": "September 20, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"}
{"date": "September 20, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"}
{"date": "September 20, 1958", "artist": "Poni-Tails", "song": "Born Too Late"}
{"date": "September 20, 1958", "artist": "The Olympics", "song": "Western Movies"}
{"date": "September 20, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"}
{"date": "September 20, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"}
{"date": "September 27, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"}
{"date": "September 27, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"}
{"date": "September 27, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"}
{"date": "September 27, 1958", "artist": "The Elegants", "song": "Little Star"}
{"date": "September 27, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"}
{"date": "September 27, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"}
{"date": "September 27, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"}
...
Also, take a look at grequests as an alternative tool.
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With