Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape 160.000 pages - too slow

Tags:

python

urllib2

I have a file with more than 160.000 urls, of which pages I want to scrape some information. The script looks roughly like this:

htmlfile = urllib2.urlopen(line)
htmltext = htmlfile.read()
regexName = '"></a>(.+?)</dd><dt>'
patternName = re.compile(regexName)
name = re.findall(patternName,htmltext)
if name:
   text = name[0]
else:
   text = 'unknown'

nf.write(text)

Which works, but very, very slow. It would take more than four days to scrape all 160.000 pages. Any suggestions to speed it up?

like image 275
ticktack Avatar asked Oct 22 '22 07:10

ticktack


1 Answers

Some advice on your code:

When you compile the regex pattern, make sure you also use the compiled object. And avoid compiling your regex in each processing loop.

pattern = re.compile('"></a>(.+?)</dd><dt>')
# ...
links = pattern.findall(html)

If you want to avoid using other frameworks, the best solution to speed things up it so use the standard threading library in order to get multiple HTTP connections going in parallel.

Something like this:

from Queue import Queue
from threading import Thread

import urllib2
import re

# Work queue where you push the URLs onto - size 100
url_queue = Queue(10)
pattern = re.compile('"></a>(.+?)</dd><dt>')

def worker():
    '''Gets the next url from the queue and processes it'''
    while True:
        url = url_queue.get()
        print url
        html = urllib2.urlopen(url).read()
        print html[:10]
        links = pattern.findall(html)
        if len(links) > 0:
            print links
        url_queue.task_done()

# Start a pool of 20 workers
for i in xrange(20):
     t = Thread(target=worker)
     t.daemon = True
     t.start()

# Change this to read your links and queue them for processing
for url in xrange(100):
    url_queue.put("http://www.ravn.co.uk")

# Block until everything is finished.
url_queue.join()   
like image 70
JanRavn Avatar answered Nov 03 '22 22:11

JanRavn