Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python,multi-threads,fetch webpages,download webpages

Tags:

I want to batch dowload webpages in one site. There are 5000000 urls links in my 'urls.txt' file. It's about 300M. How make a multi-threads link these urls and dowload these webpages? or How batch dowload these webpages?

my ideas:

with open('urls.txt','r') as f:
    for el in f:
        ##fetch these urls

or twisted?

Is there a good solution for it?

like image 452
bell007 Avatar asked Jan 25 '10 19:01

bell007


1 Answers

If this isn't part of a larger program, then notnoop's idea of using some existing tool to accomplish this is a pretty good one. If a shell loop invoking wget solves your problem, that'll be a lot easier than anything involving more custom software development.

However, if you need to fetch these resources as part of a larger program, then doing it with shell may not be ideal. In this case, I'll strongly recommend Twisted, which will make it easy to do many requests in parallel.

A few years ago I wrote up an example of how to do just this. Take a look at http://jcalderone.livejournal.com/24285.html.

like image 133
Jean-Paul Calderone Avatar answered Oct 12 '22 12:10

Jean-Paul Calderone