Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

when doing downloading with python,should I use multithreading or multiprocessing?

Recently I'm working on a program which can download manga from a online manga website.It works but a bit slow.So I decide to use multithreading/processing to speed up downloading.Here are my questions:

  1. which one is better?(this is a python3 program)

  2. multiprocessing,I think,will definitely work.If I use multiprocessing,what is the suitable amount of processes?Does it relate to the number of cores in my CPU?

  3. multithreading will probably work.This download work obviously needs much time to wait for pics to be downloaded,so I think when a thread starts waiting,python will make another thread work.Am I correct?
    I've read 《Inside the New GIL》by David M.Beazley.What's the influence of GIL if I use multithreading?

like image 973
laike9m Avatar asked Mar 12 '13 02:03

laike9m


2 Answers

You're probably going to be bound by either the server's upload pipe (if you have a faster connection) or your download pipe (if you have a slower connection).

There's significant startup latency associated with TCP connections. To avoid this, HTTP servers can recycle connections for requesting multiple resources. So there are two ways for your client to avoid this latency hit:

(a) Download several resources over a single TCP connection so your program only suffers the latency once, when downloading the first file

(b) Download a single resource per TCP connection, and use multiple connections so that hopefully at every point in time, at least one of them will be downloading at full speed

With option (a), you want to look into how to recycle requests with whatever HTTP library you're using. Any good one will have a way to recycle connections. http://python-requests.org/ is a good Python HTTP library.

For option (b), you probably do want a multithread/multiprocess route. I'd suggest only 2-3 simultaneous threads, since any more will likely just result in sharing bandwidth among the connections, and raise the risk of getting banned for multiple downloads.

The GIL doesn't really matter for this use case, since your code will be doing almost no processing, spending most of its time waiting bytes to arrive over the network.

The lazy way to do this is to avoid Python entirely because most UNIX-like environments have good building blocks for this. (If you're on Windows, your best choices for this approach would be msys, cygwin, or a VirtualBox running some flavor of Linux, I personally like Linux Mint.) If you have a list of URL's you want to download, one per line, in a text file, try this:

cat myfile.txt | xargs -n 1 --max-procs 3 --verbose wget

The "xargs" command with these parameters will take a whitespace-delimited URL's on stdin (in this case coming from myfile.txt) and run "wget" on each of them. It will allow up to 3 "wget" subprocesses to run at a time, when one of them completes (or errors out), it will read another line and launch another subprocess, until all the input URL's are exhausted. If you need cookies or other complicated stuff, curl might be a better choice than wget.

like image 89
picomancer Avatar answered Oct 29 '22 17:10

picomancer


It doesn't really matter. It is indeed true that threads waiting on IO won't get in the way of other threads running, and since downloading over the Internet is an IO-bound task, there's no real reason to try to spread your execution threads over multiple CPUs. Given that and the fact that threads are more light-weight than processes, it might be better to use threads, but you honestly aren't going to notice the difference.

How many threads you should use depends on how hard you want to hit the website. Be courteous and take care that your scraping isn't viewed as a DOS attack.

like image 45
Cairnarvon Avatar answered Oct 29 '22 15:10

Cairnarvon