I have a list of web address (> 10k) that need to download, if I download using a single thread application is very time consuming, which one of multithread or multiple backgroundworker instance is a better option and why?
The approach you should use depends in large part on just how quickly you want to download those 10,000 pages and how often you want to do it.
In general, you can expect a single-threaded download application to average about one page per second. Your results will vary depending on the sites you're downloading from. Getting stuff from yahoo.com is going to be faster than downloading from a server that somebody's hosting on a cable modem. The nice thing about a single-threaded download application is that it's very easy to write. If you only need to download those pages once, write the single-threaded app, put it to work, and take a long lunch. You'll have your data in about three hours.
If you have a quad-core machine, you can do about four pages per second. Just write your single-threaded application, split your URLs list into four equal pieces, start four instances of your application, and take a regular lunch. You'll have the data when you get back.
If you'll be downloading those pages on a regular basis, then you can write your program to maintain a BlockingCollection for your URLs. Spin up four threads, each of which does essentially this:
while (queue not empty)
{
dequeue url
download page
}
That will execute in the same amount of time as having four separate instances of the single-threaded downloader. Actually, it will probably execute slightly faster because the you're not splitting the queue. So you don't have the problem of one thread finishing its queue and stopping while there are still URLs left to download. Again, the program is incredibly easy to write and you'll have those 10,000 pages in under an hour.
You can go much faster than that. At typical cable modem speeds, you can achieve close to 20 pages per second without too much trouble. Forget using the TPL or ThreadPool.QueueUserWorkItem
, etc. Instead, use WebClient and DownloadDataAsync. Create a queue of, say, 10 WebClient
instances. Then, your main thread does this:
while (url queue is not empty)
{
client = dequeue WebClient // this will block if all clients are currently busy
url = dequeue url
client.DownloadDataAsync(url)
}
The WebClient instance's DownloadDataCompleted event handler will be called when the download is completed, so you can save the data. It also puts the WebClient
instance back into the queue so that it can be re-used.
Again, this is a fairly simple approach, but it's very effective. It takes advantage of the asynchronous capabilities of HttpWebRequest
(which is what WebClient
uses to do its thing). With this approach you don't end up with 10 or more threads executing all the time. Instead, the thread pool spins up and uses only as many threads as required to read the data and execute your callback. If you use TPL or some other explicit multithreading technique, you end up with a bunch of threads that spend most of their time doing nothing while waiting for connections, etc.
You'll have to play with the number of concurrent downloads (i.e. the number of WebClient
instances you have in your queue). How many you can support depends mostly on the speed of your Internet connection. It will also depend on the average latency of DNS requests, which can be surprisingly long, and on how many different domains you're downloading from.
One other caution when using a multithreaded approach is that of politeness. If all 10,000 of those URLs are from the same domain, you do not want to be hitting it with 10 simultaneous requests. They'll likely think you're trying to perpetrate a DOS attack, and block you. If those URLs are from just a handful of domains, you'll need to throttle your connections. If you only have a handful of URLs from any one particular domain, then this isn't a problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With