I have a file with more than 160.000 urls, of which pages I want to scrape some information. The script looks roughly like this:
htmlfile = urllib2.urlopen(line)
htmltext = htmlfile.read()
regexName = '"></a>(.+?)</dd><dt>'
patternName = re.compile(regexName)
name = re.findall(patternName,htmltext)
if name:
text = name[0]
else:
text = 'unknown'
nf.write(text)
Which works, but very, very slow. It would take more than four days to scrape all 160.000 pages. Any suggestions to speed it up?
Some advice on your code:
When you compile the regex pattern, make sure you also use the compiled object. And avoid compiling your regex in each processing loop.
pattern = re.compile('"></a>(.+?)</dd><dt>')
# ...
links = pattern.findall(html)
If you want to avoid using other frameworks, the best solution to speed things up it so use the standard threading library in order to get multiple HTTP connections going in parallel.
Something like this:
from Queue import Queue
from threading import Thread
import urllib2
import re
# Work queue where you push the URLs onto - size 100
url_queue = Queue(10)
pattern = re.compile('"></a>(.+?)</dd><dt>')
def worker():
'''Gets the next url from the queue and processes it'''
while True:
url = url_queue.get()
print url
html = urllib2.urlopen(url).read()
print html[:10]
links = pattern.findall(html)
if len(links) > 0:
print links
url_queue.task_done()
# Start a pool of 20 workers
for i in xrange(20):
t = Thread(target=worker)
t.daemon = True
t.start()
# Change this to read your links and queue them for processing
for url in xrange(100):
url_queue.put("http://www.ravn.co.uk")
# Block until everything is finished.
url_queue.join()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With