I'm currently testing out Parallel for C#. Generally it works fine, and using parallel is faster than the normal foreach loops. However, at times (like 1 out of 5 times), my CPU will reach 100% usage, causing parallel tasks to be very slow. My CPU setup is i5-4570 with 8gb ram. Does anyone have any idea why this problem occurs? Below are the codes I used to test out the function <pre class="prettyprint"><code> // Using normal foreach ConcurrentBag<int> resultData = new ConcurrentBag<int>(); Stopwatch sw = new Stopwatch(); sw.Start(); foreach (var item in testData) { if (item.Equals(1)) { resultData.Add(item); } } Console.WriteLine("Normal ForEach " + sw.ElapsedMilliseconds); // Using list parallel for resultData = new ConcurrentBag<int>(); sw.Restart(); System.Threading.Tasks.Parallel.For(0, testData.Count() - 1, (i, loopState) => { int data = testData[i]; if (data.Equals(1)) { resultData.Add(data); } }); Console.WriteLine("List Parallel For " + sw.ElapsedMilliseconds); // Using list parallel foreach //resultData.Clear(); resultData = new ConcurrentBag<int>(); sw.Restart(); System.Threading.Tasks.Parallel.ForEach(testData, (item, loopState) => { if (item.Equals(1)) { resultData.Add(item); } }); Console.WriteLine("List Parallel ForEach " + sw.ElapsedMilliseconds); // Using concurrent parallel for ConcurrentStack<int> resultData2 = new ConcurrentStack<int>(); sw.Restart(); System.Threading.Tasks.Parallel.For(0, testData.Count() - 1, (i, loopState) => { int data = testData[i]; if (data.Equals(1)) { resultData2.Push(data); } }); Console.WriteLine("Concurrent Parallel For " + sw.ElapsedMilliseconds); // Using concurrent parallel foreach resultData2.Clear(); sw.Restart(); System.Threading.Tasks.Parallel.ForEach(testData, (item, loopState) => { if (item.Equals(1)) { resultData2.Push(item); } }); Console.WriteLine("Concurrent Parallel ForEach " + sw.ElapsedMilliseconds); </code></pre> <h3>Normal output</h3> Normal ForEach 493 List Parallel For 315 List Parallel ForEach 328 Concurrent Parallel For 286 Concurrent Parallel ForEach 292 <h3>During 100% CPU usage</h3> Normal ForEach 476 List Parallel For 8047 List Parallel ForEach 276 Concurrent Parallel For 281 Concurrent Parallel ForEach 3960 (This can occur during any of the parallel tasks, the above is only one instance) <h3>Update</h3> By using the PLINQ method provided by @willaien and running it 100 times, this problem no longer occurs. I still have no idea why this issue would surface in the first place though. <pre class="prettyprint"><code>var resultData3 = testData.AsParallel().Where(x => x == 1).ToList(); </code></pre>

First of all, careful with <code>Parallel</code> - it doesn't shield you from thread safety issues. In your original code, you used non-thread-safe code when filling the list of results. In general, you want to avoid sharing any state (although the read-only access to the list is fine in a case like this). If you really want to use <code>Parallel.For</code> or <code>Parallel.ForEach</code> for filtering and aggregation (really, <code>AsParallel</code> is what you want in those cases), you should use the overload with thread-local state - you'd do the final results aggregation in the <code>localFinally</code> delegate (note that it's still run on a different thread, so you need to ensure thread-safety; however, locking is fine in that case, since you're only doing this once per thread, rather than on every iteration). Now, the obvious first thing to try in a problem like this is to use a profiler. So I've done that. The results are as follows: <ul> <li>There is hardly any memory allocations in either of those solutions. They are entirely dwarfed by the initial test data allocation, even for relatively small test data (I used 1M, 10M and 100M of integers when testing).</li> <li>The work being done is in the e.g. <code>Parallel.For</code> or <code>Parallel.ForEach</code> bodies themselves, not in your code (the simple <code>if (data[i] == 1) results.Add(data[i])</code>).</li> </ul> The first means we can say GC probably isn't the culprit. Indeed, it doesn't get any chance to run. The second is more curious - it means that in some cases, the overhead of <code>Parallel</code> is way out of line - but it's seemingly random, sometimes it works without a hitch, and sometimes it takes half a second. This would usually point to the GC, but we've ruled that out already. I've tried using the overload without the loop state, but that didn't help. I've tried limiting the <code>MaxDegreeOfParallelism</code>, but it only ever hurt things. Now, obviously, this code is absolutely dominated by cache access - there's hardly any CPU work and no I/O - which will always favour a single-threaded solution; but even using <code>MaxDegreeOfParallelism</code> of 1 doesn't help - indeed, 2 seems to be the fastest on my system. More is useless - again, cache access dominates. It's still curious - I'm using a server CPU for the tests, which has plenty of cache for all of the data at once, and while we're not doing a 100% sequential access (which pretty much gets rid of the latency entirely), it should be quite sequential enough. Regardless, we have the base line of memory throughput in the single-threaded solution, and it's very close to the speed of the parallelised case when it works well (parallelised, I'm reading 40% less runtime than single-threaded, on a four-core server CPU for a embarassingly parallel problem - again, obviously, memory access is the limit). So, it's time to check the reference source for <code>Parallel.For</code>. In a case like this, it simply creates ranges based on the amount of workers - one range for each. So it's not the ranges - there's no overhead from that. The core simply runs a task that iterates over the given range. There's a few interesting bits - for example, the task will get "suspended" if it takes too long. However, it doesn't seem to fit the data too well - why would something like this cause random delays unrelated to the data size? No matter how small the work job, and no matter how low <code>MaxDegreeOfParallelism</code> is, we get "random" slowdowns. It might be a problem, but I have no idea how to check it. The most interesting thing is that expanding the test data does nothing with the anomaly - while it makes the "good" parallel runs much faster (even getting close to perfect efficiency in my tests, oddly enough), the "bad" ones are still just as bad. In fact, in a few of my test runs, they are absurdly bad (up to ten times the "normal" loop). So, let's have a look at the threads. I've artifically bumped the amount of threads in the <code>ThreadPool</code> to ensure that expanding the thread pool isn't a bottleneck (it shouldn't if everything worked well, but...). And here comes the first surprise - while the "good" runs simply use the 4-8 threads that make sense, the "bad" runs expand over all the available threads in the pool, even if there's a hundred of them. Oops? Let's dive into source code once again. <code>Parallel</code> internally uses <code>Task.RunSynchronously</code> to run the root partitioned work job, and <code>Wait</code>s on the result. When I look at parallel stacks, there's 97 threads executing the loop body, and only one that actually has <code>RunSynchronously</code> on stack (as expected - that's the main thread). The others are plain threadpool threads. The task IDs also tell a story - there's thousands of individual tasks being created while doing the iteration. Obviously, something is very wrong here. Even if I remove the whole loop body, this still happens, so it's not some closure weirdness either. Explicitly setting <code>MaxDegreeOfParallelism</code> offsets this somewhat - the amount of threads used doesn't explode anymore - however, the amount of tasks still does. But we've already seen that the ranges are just the amount of parallel tasks running - so why keep creating more and more tasks? Using the debugger confirms this - with MaxDOP of four, there's only five ranges (there's some alignment that causes the fifth range). Interestingly, one of the completed ranges (how did the first one finish so much ahead of the rest?) has index higher than the range it iterates - this is because the "scheduler" assigns range-partitions in slices of up to 16. The root task is self-replicating, so instead of explicitly starting e.g. four tasks to handle the data, it waits for the scheduler to replicate the task to handle more data. It's kind of hard to read - we're talking about complex multi-threaded lock-less code, but it seems that it always assigns work in slices much smaller than the partitioned ranges. In my testing, the maximal size of the slice was 16 - a far cry from the millions of data I'm running. 16 iterations with a body like this is no time at all, which might lead to many issues with the algorithm (the biggest being the infrastructure taking more CPU work than the actual iterator body). In some cases, cache trashing might impact performance even further (perhaps when there's a lot of variation in the body runtimes), but most of the time, the access is sequential enough. TL; DR Don't use <code>Parallel.For</code> and <code>Parallel.ForEach</code> if your work-per-iteration is very short (on the order of milliseconds). <code>AsParallel</code> or just running the iteration single-threaded will most likely be much faster. Slightly longer explanation: It seems that <code>Parallel.For</code> and <code>Paraller.ForEach</code> are designed for scenarios where the individual items you're iterating over take a substantial amount of time to execute (i.e. lots of work per item, not tiny amounts of work over a lot of items). They seem to perform badly when the iterator body is too short. If you're not doing substantial work in the iterator body, use <code>AsParallel</code> instead of <code>Parallel.*</code>. The sweetspot seems to be somewhere under 150ms per slice (around 10ms per iteration). Otherwise, <code>Parallel.*</code> will spend tons of time in its own code, and hardly any time doing your iteration (in my case, the usual number was somewhere around 5-10% in the body - embarassingly bad). Sadly, I didn't find any warning about this on MSDN - there's even samples going over substantial amounts of data, but there's no hint at the terrible performance hit of doing so. Testing the very same sample code on my computer, I've found that it is indeed often slower than a single-threaded iteration, and at the best of times, barely faster (around 30-40% time savings while running on four CPU cores - not very efficient). EDIT: Willaien found a mention on MSDN about this very issue, and how to solve it - https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx. The idea is to use a custom partitioner and iterate over that in the <code>Parallel.For</code> body (e.g. loop in <code>Parallel.For</code>'s loop). However, for most cases, using <code>AsParallel</code> is probably still a better choice - simple loop bodies usually mean some kind of a map/reduce operation, and <code>AsParallel</code> and LINQ in general are great at that. For example, your sample code could be rewritten simply as: <pre class="prettyprint"><code>var result = testData.AsParallel().Where(i => i == 1).ToList(); </code></pre> The only case where using <code>AsParallel</code> is a bad idea is the same as with all the other LINQs - when your loop body has side-effects. Some might be tolerable, but it's safer to avoid them altogether.

Task Parallel is unstable, using 100% CPU at times

Tags:

c#

parallel-processing

cpu-usage

I'm currently testing out Parallel for C#. Generally it works fine, and using parallel is faster than the normal foreach loops. However, at times (like 1 out of 5 times), my CPU will reach 100% usage, causing parallel tasks to be very slow. My CPU setup is i5-4570 with 8gb ram. Does anyone have any idea why this problem occurs?

Below are the codes I used to test out the function

            // Using normal foreach
            ConcurrentBag<int> resultData = new ConcurrentBag<int>();
            Stopwatch sw = new Stopwatch();
            sw.Start();
            foreach (var item in testData)
            {
                if (item.Equals(1))
                {
                    resultData.Add(item);
                }
            }
            Console.WriteLine("Normal ForEach " + sw.ElapsedMilliseconds);

            // Using list parallel for
            resultData = new ConcurrentBag<int>();
            sw.Restart();
            System.Threading.Tasks.Parallel.For(0, testData.Count() - 1, (i, loopState) =>
            {
                int data = testData[i];
                if (data.Equals(1))
                {
                    resultData.Add(data);
                }
            });
            Console.WriteLine("List Parallel For " + sw.ElapsedMilliseconds);

            // Using list parallel foreach
            //resultData.Clear();
            resultData = new ConcurrentBag<int>();
            sw.Restart();
            System.Threading.Tasks.Parallel.ForEach(testData, (item, loopState) =>
            {
                if (item.Equals(1))
                {
                    resultData.Add(item);
                }
            });
            Console.WriteLine("List Parallel ForEach " + sw.ElapsedMilliseconds);

            // Using concurrent parallel for 
            ConcurrentStack<int> resultData2 = new ConcurrentStack<int>();
            sw.Restart();
            System.Threading.Tasks.Parallel.For(0, testData.Count() - 1, (i, loopState) =>
            {
                int data = testData[i];
                if (data.Equals(1))
                {
                    resultData2.Push(data);
                }
            });
            Console.WriteLine("Concurrent Parallel For " + sw.ElapsedMilliseconds);

            // Using concurrent parallel foreach
            resultData2.Clear();
            sw.Restart();
            System.Threading.Tasks.Parallel.ForEach(testData, (item, loopState) =>
            {
                if (item.Equals(1))
                {
                    resultData2.Push(item);
                }
            });
            Console.WriteLine("Concurrent Parallel ForEach " + sw.ElapsedMilliseconds);

Normal output

Normal ForEach 493

List Parallel For 315

List Parallel ForEach 328

Concurrent Parallel For 286

Concurrent Parallel ForEach 292

During 100% CPU usage

Normal ForEach 476

List Parallel For 8047

List Parallel ForEach 276

Concurrent Parallel For 281

Concurrent Parallel ForEach 3960

(This can occur during any of the parallel tasks, the above is only one instance)

Update

By using the PLINQ method provided by @willaien and running it 100 times, this problem no longer occurs. I still have no idea why this issue would surface in the first place though.

var resultData3 = testData.AsParallel().Where(x => x == 1).ToList();

535

asked Aug 17 '15 01:08

Andy Wei

2 Answers

First of all, careful with Parallel - it doesn't shield you from thread safety issues. In your original code, you used non-thread-safe code when filling the list of results. In general, you want to avoid sharing any state (although the read-only access to the list is fine in a case like this). If you really want to use Parallel.For or Parallel.ForEach for filtering and aggregation (really, AsParallel is what you want in those cases), you should use the overload with thread-local state - you'd do the final results aggregation in the localFinally delegate (note that it's still run on a different thread, so you need to ensure thread-safety; however, locking is fine in that case, since you're only doing this once per thread, rather than on every iteration).

Now, the obvious first thing to try in a problem like this is to use a profiler. So I've done that. The results are as follows:

There is hardly any memory allocations in either of those solutions. They are entirely dwarfed by the initial test data allocation, even for relatively small test data (I used 1M, 10M and 100M of integers when testing).
The work being done is in the e.g. Parallel.For or Parallel.ForEach bodies themselves, not in your code (the simple if (data[i] == 1) results.Add(data[i])).

The first means we can say GC probably isn't the culprit. Indeed, it doesn't get any chance to run. The second is more curious - it means that in some cases, the overhead of Parallel is way out of line - but it's seemingly random, sometimes it works without a hitch, and sometimes it takes half a second. This would usually point to the GC, but we've ruled that out already.

I've tried using the overload without the loop state, but that didn't help. I've tried limiting the MaxDegreeOfParallelism, but it only ever hurt things. Now, obviously, this code is absolutely dominated by cache access - there's hardly any CPU work and no I/O - which will always favour a single-threaded solution; but even using MaxDegreeOfParallelism of 1 doesn't help - indeed, 2 seems to be the fastest on my system. More is useless - again, cache access dominates. It's still curious - I'm using a server CPU for the tests, which has plenty of cache for all of the data at once, and while we're not doing a 100% sequential access (which pretty much gets rid of the latency entirely), it should be quite sequential enough. Regardless, we have the base line of memory throughput in the single-threaded solution, and it's very close to the speed of the parallelised case when it works well (parallelised, I'm reading 40% less runtime than single-threaded, on a four-core server CPU for a embarassingly parallel problem - again, obviously, memory access is the limit).

So, it's time to check the reference source for Parallel.For. In a case like this, it simply creates ranges based on the amount of workers - one range for each. So it's not the ranges - there's no overhead from that. The core simply runs a task that iterates over the given range. There's a few interesting bits - for example, the task will get "suspended" if it takes too long. However, it doesn't seem to fit the data too well - why would something like this cause random delays unrelated to the data size? No matter how small the work job, and no matter how low MaxDegreeOfParallelism is, we get "random" slowdowns. It might be a problem, but I have no idea how to check it.

The most interesting thing is that expanding the test data does nothing with the anomaly - while it makes the "good" parallel runs much faster (even getting close to perfect efficiency in my tests, oddly enough), the "bad" ones are still just as bad. In fact, in a few of my test runs, they are absurdly bad (up to ten times the "normal" loop).

So, let's have a look at the threads. I've artifically bumped the amount of threads in the ThreadPool to ensure that expanding the thread pool isn't a bottleneck (it shouldn't if everything worked well, but...). And here comes the first surprise - while the "good" runs simply use the 4-8 threads that make sense, the "bad" runs expand over all the available threads in the pool, even if there's a hundred of them. Oops?

Let's dive into source code once again. Parallel internally uses Task.RunSynchronously to run the root partitioned work job, and Waits on the result. When I look at parallel stacks, there's 97 threads executing the loop body, and only one that actually has RunSynchronously on stack (as expected - that's the main thread). The others are plain threadpool threads. The task IDs also tell a story - there's thousands of individual tasks being created while doing the iteration. Obviously, something is very wrong here. Even if I remove the whole loop body, this still happens, so it's not some closure weirdness either.

Explicitly setting MaxDegreeOfParallelism offsets this somewhat - the amount of threads used doesn't explode anymore - however, the amount of tasks still does. But we've already seen that the ranges are just the amount of parallel tasks running - so why keep creating more and more tasks? Using the debugger confirms this - with MaxDOP of four, there's only five ranges (there's some alignment that causes the fifth range). Interestingly, one of the completed ranges (how did the first one finish so much ahead of the rest?) has index higher than the range it iterates - this is because the "scheduler" assigns range-partitions in slices of up to 16.

The root task is self-replicating, so instead of explicitly starting e.g. four tasks to handle the data, it waits for the scheduler to replicate the task to handle more data. It's kind of hard to read - we're talking about complex multi-threaded lock-less code, but it seems that it always assigns work in slices much smaller than the partitioned ranges. In my testing, the maximal size of the slice was 16 - a far cry from the millions of data I'm running. 16 iterations with a body like this is no time at all, which might lead to many issues with the algorithm (the biggest being the infrastructure taking more CPU work than the actual iterator body). In some cases, cache trashing might impact performance even further (perhaps when there's a lot of variation in the body runtimes), but most of the time, the access is sequential enough.

TL; DR

Don't use Parallel.For and Parallel.ForEach if your work-per-iteration is very short (on the order of milliseconds). AsParallel or just running the iteration single-threaded will most likely be much faster.

Slightly longer explanation:

It seems that Parallel.For and Paraller.ForEach are designed for scenarios where the individual items you're iterating over take a substantial amount of time to execute (i.e. lots of work per item, not tiny amounts of work over a lot of items). They seem to perform badly when the iterator body is too short. If you're not doing substantial work in the iterator body, use AsParallel instead of Parallel.*. The sweetspot seems to be somewhere under 150ms per slice (around 10ms per iteration). Otherwise, Parallel.* will spend tons of time in its own code, and hardly any time doing your iteration (in my case, the usual number was somewhere around 5-10% in the body - embarassingly bad).

Sadly, I didn't find any warning about this on MSDN - there's even samples going over substantial amounts of data, but there's no hint at the terrible performance hit of doing so. Testing the very same sample code on my computer, I've found that it is indeed often slower than a single-threaded iteration, and at the best of times, barely faster (around 30-40% time savings while running on four CPU cores - not very efficient).

EDIT:

Willaien found a mention on MSDN about this very issue, and how to solve it - https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx. The idea is to use a custom partitioner and iterate over that in the Parallel.For body (e.g. loop in Parallel.For's loop). However, for most cases, using AsParallel is probably still a better choice - simple loop bodies usually mean some kind of a map/reduce operation, and AsParallel and LINQ in general are great at that. For example, your sample code could be rewritten simply as:

var result = testData.AsParallel().Where(i => i == 1).ToList();

The only case where using AsParallel is a bad idea is the same as with all the other LINQs - when your loop body has side-effects. Some might be tolerable, but it's safer to avoid them altogether.

114

answered Oct 18 '22 19:10

Luaan

After some analysis, you're likely not even adding to these collections: 100,000,000 elements is still quite a bit smaller than the key search space (Approx. 2.1 billion), so these will likely not have any elements added, or just one or two.

As for the particular issue, while I'm able to replicate it, I'm unable to give a direct answer as to why this is happening, but, I suspect it has to do with heavy contention around the memory bus in some way, and how it handles the partitioning and thread creation. Limiting the number of threads to the current processor count seems to help, but, it doesn't solve the issue entirely.

All that said, a PLINQ version of things seems to be much faster and more consistent:

var resultData = testData.AsParallel().Where(x => x == 1).ToList();

Edit: It looks like this is a semi-obscure, but known issue, more details are available here: https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx

answered Oct 18 '22 18:10

willaien

Related questions
                            
                                Why does my form NOT lock up?
                            
                                Include code examples in separate file for Sandcastles
                            
                                How to use EntityResolver with Azure Storage?
                            
                                How to implement auto archiving for log file using Log4Net
                            
                                How can I delete rows and columns from 2D array in C#?
                            
                                How to stream file from disk to client browser in .Net MVC
                            
                                How to get the application specific data folder (ProgramData)?
                            
                                How do you show progress in the Taskbar with Winform C# 4.5 [duplicate]
                            
                                Deserializing an array of different object types
                            
                                Which "C# Experimental language feature" is this?
                            
                                Publish message using exchange and routing key using MassTransit
                            
                                ServiceStack Soap 1.2 HTTPS Client
                            
                                How to resolve System.InvalidOperationException using MVC 6 - Same template
                            
                                Identifying the logged in user from different browser tab in asp.net
                            
                                C# 64 bit release build started without debugging behaves differently to when started with debugging (BigInteger)
                            
                                Warning: MapServiceToken not specified
                            
                                Having different color tones for DataGrid columns
                            
                                What the difference between array indexer and any other object indexer
                            
                                RandomNumberGenerator proper usage
                            
                                Public List without Add

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With