I've been thinking about making my web scraper multithreaded, not like normal threads (egThread scrape = new Thread(Function);) but something like a threadpool where there can be a very large number of threads.
My scraper works by using a for
loop to scrape pages.
for (int i = (int)pagesMin.Value; i <= (int)pagesMax.Value; i++)
So how could I multithread the function (that contains the loop) with something like a threadpool? I've never used threadpools before and the examples I've seen have been quite confusing or obscure to me.
I've modified my loop into this:
int min = (int)pagesMin.Value;
int max = (int)pagesMax.Value;
ParallelOptions pOptions = new ParallelOptions();
pOptions.MaxDegreeOfParallelism = Properties.Settings.Default.Threads;
Parallel.For(min, max, pOptions, i =>{
//Scraping
});
Would that work or have I got something wrong?
The problem with using pool threads is that they spend most of their time waiting for a response from the Web site. And the problem with using Parallel.ForEach
is that it limits your parallelism.
I got the best performance by using asynchronous Web requests. I used a Semaphore
to limit the number of concurrent requests, and the callback function did the scraping.
The main thread creates the Semaphore
, like this:
Semaphore _requestsSemaphore = new Semaphore(20, 20);
The 20
was derived by trial and error. It turns out that the limiting factor is DNS resolution and, on average, it takes about 50 ms. At least, it did in my environment. 20 concurrent requests was the absolute maximum. 15 is probably more reasonable.
The main thread essentially loops, like this:
while (true)
{
_requestsSemaphore.WaitOne();
string urlToCrawl = DequeueUrl(); // however you do that
var request = (HttpWebRequest)WebRequest.Create(urlToCrawl);
// set request properties as appropriate
// and then do an asynchronous request
request.BeginGetResponse(ResponseCallback, request);
}
The ResponseCallback
method, which will be called on a pool thread, does the processing, disposes of the response, and then releases the semaphore so that another request can be made.
void ResponseCallback(IAsyncResult ir)
{
try
{
var request = (HttpWebRequest)ir.AsyncState;
// you'll want exception handling here
using (var response = (HttpWebResponse)request.EndGetResponse(ir))
{
// process the response here.
}
}
finally
{
// release the semaphore so that another request can be made
_requestSemaphore.Release();
}
}
The limiting factor, as I said, is DNS resolution. It turns out that DNS resolution is done on the calling thread (the main thread in this case). See Is this really asynchronous? for more information.
This is simple to implement and works quite well. It's possible to get even more than 20 concurrent requests, but doing so takes quite a bit of effort, in my experience. I had to do a lot of DNS caching and ... well, it was difficult.
You can probably simplify the above by using Task
and the new async stuff in C# 5.0 (.NET 4.5). I'm not familiar enough with those to say how, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With