Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-threaded web requests in python -- 'Name or service not known'

I have a large scraping job to do -- most of the script's time is spent blocking due to a lot of network latency. I'm trying to multi-thread the script so I can make multiple requests simultaneously, but about 10% of my threads die with the following error

URLError: <urlopen error [Errno -2] Name or service not known>

The other 90% complete successfully. I am requesting multiple pages from the same domain, so it seems like there may be some DNS issue. I make 25 requests at a time (25 threads). Everything works fine if i limit myself to 5 requests at a time, but once I get to around 10 requests, I start seeing this error sometimes.

I have read Repeated host lookups failing in urllib2 which describes the same issue I have and followed the suggestions therein, but to no avail.

I have also tried using the multiprocessing module instead of multi-threading, I get the same behaviour -- about 10% of the processes die with the same error -- which leads me to believe this is not an issue with urllib2 but something else.

Can someone explain what is going on and suggest how to fix?

UPDATE

If I manually code the ip address of the site into my script everything works perfectly, so this error happens sometime during the DNS lookup.

like image 611
Jesse Cohen Avatar asked Feb 12 '11 18:02

Jesse Cohen


1 Answers

Suggestion: Try enabling a DNS cache in your system, such as nscd. This should eliminate DNS lookup problems if your scraper always makes requests to the same domain.

Make sure that the file objects returned by urllib2.urlopen are properly closed after being read, in order to free resources. Otherwise, you may reach the limit of max open sockets in your system.

Also, take into account the politeness policy web crawlers should have to avoid overloading a server with multiple requests.

like image 78
scoffey Avatar answered Oct 26 '22 00:10

scoffey