Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel scraping in .NET

The company I work for runs a few hundred very dynamic web sites. It has decided to build a search engine and I was tasked with writing the scraper. Some of the sites run on old hardware and are not able to take much punishment, while others can handle massive amount of simultaneous users.

I need to be able to say use 5 parallel requests for site A, 2 for site B and 1 for site C.

I know I can use threads, mutexes, semaphores, etc. to accomplish this, but it will be quite complicated. Are any of the higher level frameworks, like TPL, await/async, TPL Dataflow powerful enough to do this app in a simpler manner?

like image 576
Bob Avatar asked Mar 06 '14 17:03

Bob


2 Answers

I recommend you use HttpClient with Task.WhenAll, with SemaphoreSlim for simple throttling:

private SemaphoreSlim _mutex = new SemaphoreSlim(5);
private HttpClient _client = new HttpClient();
private async Task<string> DownloadStringAsync(string url)
{
  await _mutex.TakeAsync();
  try
  {
    return await _client.GetStringAsync(url);
  }
  finally
  {
    _mutex.Release();
  }
}

IEnumerable<string> urls = ...;
var data = await Task.WhenAll(urls.Select(url => DownloadStringAsync(url));

Alternatively, you could use TPL Dataflow and set MaxDegreeOfParallelism for the throttling.

like image 51
Stephen Cleary Avatar answered Sep 22 '22 10:09

Stephen Cleary


TPL Dataflow and async-await are indeed powerful and simple enough to be able to just what you need:

async Task<IEnumerable<string>> GetAllStringsAsync(IEnumerable<string> urls)
{
    var client = new HttpClient();
    var bag = new ConcurrentBag<string>();
    var block = new ActionBlock<string>(
        async url => bag.Add(await client.GetStringAsync(url)),
        new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 5});
    foreach (var url in urls)
    {
        block.Post(url);
    }
    block.Complete();
    await block.Completion;
    return bag;
}
like image 37
i3arnon Avatar answered Sep 25 '22 10:09

i3arnon