I have (my url list is about 1000 urls), I was wondering if there is a more effecient call multiple urls from same site (already changing the ServicePointManager.DefaultConnectionLimit
).
Also is it better to reuse the same HttpClient
or create new one on every call, below uses just one instead of multiple.
using (var client = new HttpClient { Timeout = new TimeSpan(0, 5, 0) })
{
var tasks = urls.Select(async url =>
{
await client.GetStringAsync(url).ContinueWith(response =>
{
var resultHtml = response.Result;
//process the html
});
}).ToList();
Task.WaitAll(tasks.ToArray());
}
as suggested by @cory
here is the modified code using TPL
, however i have to set the MaxDegreeOfParallelism = 100
to achieve approx same speed as the Task based, can the below code be improved?
var downloader = new ActionBlock<string>(async url =>
{
var client = new WebClient();
var resultHtml = await client.DownloadStringTaskAsync(new Uri(url));
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 100 });
foreach(var url in urls)
{
downloader.Post(url);
}
downloader.Complete();
downloader.Completion.Wait();
FINAL
public void DownloadUrlContents(List<string> urls)
{
var watch = Stopwatch.StartNew();
var httpClient = new HttpClient();
var downloader = new ActionBlock<string>(async url =>
{
var data = await httpClient.GetStringAsync(url);
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 100 });
Parallel.ForEach(urls, (url) =>
{
downloader.SendAsync(url);
});
downloader.Complete();
downloader.Completion.Wait();
Console.WriteLine($"{MethodBase.GetCurrentMethod().Name} {watch.Elapsed}");
}
Though your code will work, it's a common practice to introduce a buffer block for your ActionBlock
. Why to do this? First reason is task queue size, you can easily level the messages count in your queue. Second reason is that adding the message to buffer is almost instant, and after that it's TPL Dataflow
' responsibility to handle all your items:
// async method here
public async Task DownloadUrlContents(List<string> urls)
{
var watch = Stopwatch.StartNew();
var httpClient = new HttpClient();
// you may limit the buffer size here
var buffer = new BufferBlock<string>();
var downloader = new ActionBlock<string>(async url =>
{
var data = await httpClient.GetStringAsync(url);
// handle data here
},
// note processot count usage here
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = Environment.ProcessorCount });
// notify TPL Dataflow to send messages from buffer to loader
buffer.LinkTo(downloader, new DataflowLinkOptions {PropagateCompletion = true});
foreach (var url in urls)
{
// do await here
await buffer.SendAsync(url);
}
// queue is done
buffer.Complete();
// now it's safe to wait for completion of the downloader
await downloader.Completion;
Console.WriteLine($"{MethodBase.GetCurrentMethod().Name} {watch.Elapsed}");
}
Essentially, re-using the HttpClient
is better, because you don't have to authenticate every single time you send a request, and you can save the state of a session using cookies, unless you initialize it with a token/cookies on every creation. Other than that, it all comes down to ServicePoint
, where you can set the maximum allowed number of concurrent connections.
To do calls in parallel in more maintainable way, I would suggest to use the AsyncEnumerator NuGet package, which allows you to write a code like this:
using System.Collections.Async;
await uris.ParallelForEachAsync(
async uri =>
{
var html = await httpClient.GetStringAsync(uri, cancellationToken);
// process HTML
},
maxDegreeOfParallelism: 5,
breakLoopOnException: false,
cancellationToken: cancellationToken);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With