I want to batch dowload webpages in one site. There are 5000000 urls links in my 'urls.txt' file. It's about 300M. How make a multi-threads link these urls and dowload these webpages? or How batch dowload these webpages?
my ideas:
with open('urls.txt','r') as f:
for el in f:
##fetch these urls
or twisted?
Is there a good solution for it?
If this isn't part of a larger program, then notnoop's idea of using some existing tool to accomplish this is a pretty good one. If a shell loop invoking wget solves your problem, that'll be a lot easier than anything involving more custom software development.
However, if you need to fetch these resources as part of a larger program, then doing it with shell may not be ideal. In this case, I'll strongly recommend Twisted, which will make it easy to do many requests in parallel.
A few years ago I wrote up an example of how to do just this. Take a look at http://jcalderone.livejournal.com/24285.html.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With