I'm using parallel linq, and I'm trying to download many urls concurrently using essentily code like this:
int threads = 10;
Dictionary<string, string> results = urls.AsParallel( threads ).ToDictionary( url => url, url => GetPage( url );
Since downloading web pages is Network bound rather than CPU bound, using more threads than my number of processors/cores is very benificial, since most of the time in each thread is spent waiting for the network to catch up. However, judging form the fact that running the above with threads = 2 has the same performance as threads = 10 on my dual core machine, I'm thinking that the treads sent to AsParallel is limited to the number of cores.
Is there any way to override this behavior? Is there a similar library available that doesn't have this limitation?
(I've found such a library for python, but need something that works in .Net)
By default, .Net has limit of 2 concurrent connections to an end service point (IP:port). Thats why you would not see a difference if all urls are to one and the same server.
It can be controlled using ServicePointManager.DefaultPersistentConnectionLimit property.
Do the URLs refer to the same server? If so, it could be that you are hitting the HTTP connection limit instead of the threading limit. There's an easy way to tell - change your code to:
int threads = 10;
Dictionary<string, string> results = urls.AsParallel(threads)
.ToDictionary(url => url,
url => {
Console.WriteLine("On thread {0}",
Thread.CurrentThread.ManagedThreadId);
return GetPage(url);
});
EDIT: Hmm. I can't get ToDictionary()
to parallelise at all with a bit of sample code. It works fine for Select(url => GetPage(url))
but not ToDictionary
. Will search around a bit.
EDIT: Okay, I still can't get ToDictionary
to parallelise, but you can work around that. Here's a short but complete program:
using System;
using System.Collections.Generic;
using System.Threading;
using System.Linq;
using System.Linq.Parallel;
public class Test
{
static void Main()
{
var urls = Enumerable.Range(0, 100).Select(i => i.ToString());
int threads = 10;
Dictionary<string, string> results = urls.AsParallel(threads)
.Select(url => new { Url=url, Page=GetPage(url) })
.ToDictionary(x => x.Url, x => x.Page);
}
static string GetPage(string x)
{
Console.WriteLine("On thread {0} getting {1}",
Thread.CurrentThread.ManagedThreadId, x);
Thread.Sleep(2000);
return x;
}
}
So, how many threads does this use? 5. Why? Goodness knows. I've got 2 processors, so that's not it - and we've specified 10 threads, so that's not it. It still uses 5 even if I change GetPage
to hammer the CPU.
If you only need to use this for one particular task - and you don't mind slightly smelly code - you might be best off implementing it yourself, to be honest.
I think there are already good answers to the question, but I'd like to make one important point. Using PLINQ for tasks that are not CPU bound is in principle wrong design. Not to say that it won't work - it will, but using multiple threads when it is unnecessary can cause troubles.
Unfortunatelly, there is no good way to solve this problem in C#. In F# you could use asynchornous workflows that run in parallel, but don't block the thread when performing asynchronous calls (under the cover, it uses BeginOperation
and EndOperation
methods). You can find more information here:
The same idea can to some extent be used in C#, but it looks a bit weird (but it is more efficient). I wrote an article about that and there is also a library that should be slightly more evolved than my original idea:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With