I'm trying to write a Python script for parallel crawling of a website. I made a prototype that would allow me to crawl to depth one.
However, join()
doesn't seem to be working and I can't figure out why.
Here's my code:
from threading import Thread
import Queue
import urllib2
import re
from BeautifulSoup import *
from urlparse import urljoin
def doWork():
while True:
try:
myUrl = q_start.get(False)
except:
continue
try:
c=urllib2.urlopen(myUrl)
except:
continue
soup = BeautifulSoup(c.read())
links = soup('a')
for link in links:
if('href' in dict(link.attrs)):
url = urljoin(myUrl,link['href'])
if url.find("'")!=-1: continue
url=url.split('#')[0]
if url[0:4] == 'http':
print url
q_new.put(url)
q_start = Queue.Queue()
q_new = Queue.Queue()
for i in range(20):
t = Thread(target=doWork)
t.daemon = True
t.start()
q_start.put("http://google.com")
print "loading"
q_start.join()
print "end"
join()
will block until task_done()
has been called as many times as items have been enqueued.
You don't call task_done()
, thus join()
blocks. In the code you provide, the right place to call this is at the very end of your doWork
loop:
def doWork():
while True:
task = start_q.get(False)
...
for subtask in processed(task):
...
start_q.task_done() # tell the producer we completed a task
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With