Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multithreading a web scraper?

I've been thinking about making my web scraper multithreaded, not like normal threads (egThread scrape = new Thread(Function);) but something like a threadpool where there can be a very large number of threads.

My scraper works by using a for loop to scrape pages.

for (int i = (int)pagesMin.Value; i <= (int)pagesMax.Value; i++)

So how could I multithread the function (that contains the loop) with something like a threadpool? I've never used threadpools before and the examples I've seen have been quite confusing or obscure to me.


I've modified my loop into this:

int min = (int)pagesMin.Value;
int max = (int)pagesMax.Value;
ParallelOptions pOptions = new ParallelOptions();
pOptions.MaxDegreeOfParallelism = Properties.Settings.Default.Threads;
Parallel.For(min, max, pOptions, i =>{
    //Scraping
});

Would that work or have I got something wrong?

like image 460
AlphaDelta Avatar asked Dec 27 '22 05:12

AlphaDelta


1 Answers

The problem with using pool threads is that they spend most of their time waiting for a response from the Web site. And the problem with using Parallel.ForEach is that it limits your parallelism.

I got the best performance by using asynchronous Web requests. I used a Semaphore to limit the number of concurrent requests, and the callback function did the scraping.

The main thread creates the Semaphore, like this:

Semaphore _requestsSemaphore = new Semaphore(20, 20);

The 20 was derived by trial and error. It turns out that the limiting factor is DNS resolution and, on average, it takes about 50 ms. At least, it did in my environment. 20 concurrent requests was the absolute maximum. 15 is probably more reasonable.

The main thread essentially loops, like this:

while (true)
{
    _requestsSemaphore.WaitOne();
    string urlToCrawl = DequeueUrl();  // however you do that
    var request = (HttpWebRequest)WebRequest.Create(urlToCrawl);
    // set request properties as appropriate
    // and then do an asynchronous request
    request.BeginGetResponse(ResponseCallback, request);
}

The ResponseCallback method, which will be called on a pool thread, does the processing, disposes of the response, and then releases the semaphore so that another request can be made.

void ResponseCallback(IAsyncResult ir)
{
    try
    {
        var request = (HttpWebRequest)ir.AsyncState;
        // you'll want exception handling here
        using (var response = (HttpWebResponse)request.EndGetResponse(ir))
        {
            // process the response here.
        }
    }
    finally
    {
        // release the semaphore so that another request can be made
        _requestSemaphore.Release();
    }
}

The limiting factor, as I said, is DNS resolution. It turns out that DNS resolution is done on the calling thread (the main thread in this case). See Is this really asynchronous? for more information.

This is simple to implement and works quite well. It's possible to get even more than 20 concurrent requests, but doing so takes quite a bit of effort, in my experience. I had to do a lot of DNS caching and ... well, it was difficult.

You can probably simplify the above by using Task and the new async stuff in C# 5.0 (.NET 4.5). I'm not familiar enough with those to say how, though.

like image 121
Jim Mischel Avatar answered Jan 12 '23 07:01

Jim Mischel