Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between Python urllib.urlretrieve() and wget

I am trying to retrieve a 500mb file using Python, and I have a script which uses urllib.urlretrieve(). There seems to some network problem between me and the download site, as this call consistently hangs and fails to complete. However, using wget to retrieve the file tends to work without problems. What is the difference between urlretrieve() and wget that could cause this difference?

like image 665
jrdioko Avatar asked May 05 '10 22:05

jrdioko


1 Answers

The answer is quite simple. Python's urllib and urllib2 are nowhere near as mature and robust as they could be. Even better than wget in my experience is cURL. I've written code that downloads gigabytes of files over HTTP with file sizes ranging from 50 KB to over 2 GB. To my knowledge, cURL is the most reliable piece of software on the planet right now for this task. I don't think python, wget, or even most web browsers can match it in terms of correctness and robustness of implementation. On a modern enough python using urllib2 in the exact right way, it can be made pretty reliable, but I still run a curl subprocess and that is absolutely rock solid.

Another way to state this is that cURL does one thing only and it does it better than any other software because it has had much more development and refinement. Python's urllib2 is serviceable and convenient and works well enough for small to average workloads, but cURL is way ahead in terms of reliability.

Also, cURL has numerous options to tune the reliability behavior including retry counts, timeout values, etc.

like image 105
Peter Lyons Avatar answered Oct 03 '22 04:10

Peter Lyons