I have a large scraping job to do -- most of the script's time is spent blocking due to a lot of network latency. I'm trying to multi-thread the script so I can make multiple requests simultaneously, but about 10% of my threads die with the following error
URLError: <urlopen error [Errno -2] Name or service not known>
The other 90% complete successfully. I am requesting multiple pages from the same domain, so it seems like there may be some DNS issue. I make 25 requests at a time (25 threads). Everything works fine if i limit myself to 5 requests at a time, but once I get to around 10 requests, I start seeing this error sometimes.
I have read Repeated host lookups failing in urllib2 which describes the same issue I have and followed the suggestions therein, but to no avail.
I have also tried using the multiprocessing module instead of multi-threading, I get the same behaviour -- about 10% of the processes die with the same error -- which leads me to believe this is not an issue with urllib2 but something else.
Can someone explain what is going on and suggest how to fix?
UPDATE
If I manually code the ip address of the site into my script everything works perfectly, so this error happens sometime during the DNS lookup.
Suggestion: Try enabling a DNS cache in your system, such as nscd. This should eliminate DNS lookup problems if your scraper always makes requests to the same domain.
Make sure that the file objects returned by urllib2.urlopen
are properly closed after being read, in order to free resources. Otherwise, you may reach the limit of max open sockets in your system.
Also, take into account the politeness policy web crawlers should have to avoid overloading a server with multiple requests.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With