Recently, I had answered a question about optimizing a likely parallelizable method for generation every permutation of arbitrary base numbers. I posted an answer similar to the Parallelized, poor implementation code block list, and someone nearly immediately pointed this out:
This is pretty much guaranteed to give you false sharing and will probably be many times slower. (credit to gjvdkamp)
and they were right, it was death slow. That said, I researched the topic, and found some interesting material and suggestions (archived MSDN magazine only, .NET Matters: False Sharing) for combating it. If I understand it correctly, when threads access contiguous memory (in say, the array that's likely backing that ConcurrentStack
), false sharing likely occurs.
For code below the horizontal rule, a Bytes
is:
struct Bytes {
public byte A; public byte B; public byte C; public byte D;
public byte E; public byte F; public byte G; public byte H;
}
For my own testing, I wanted to get a parallel version of this running and be genuinely faster, so I created a simple example based on the original code. 6
as limits[0]
was a lazy choice on my part - my computer has 6 cores.
Single thread block Average run time: 10s0059ms
var data = new List<Bytes>();
var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };
for (byte a = 0; a < limits[0]; a++)
for (byte b = 0; b < limits[1]; b++)
for (byte c = 0; c < limits[2]; c++)
for (byte d = 0; d < limits[3]; d++)
for (byte e = 0; e < limits[4]; e++)
for (byte f = 0; f < limits[5]; f++)
for (byte g = 0; g < limits[6]; g++)
for (byte h = 0; h < limits[7]; h++)
data.Add(new Bytes {
A = a, B = b, C = c, D = d,
E = e, F = f, G = g, H = h
});
Parallelized, poor implementation Run time avg: 81s729ms, ~ 8700 contentions
var data = new ConcurrentStack<Bytes>();
var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };
Parallel.For(0, limits[0], (a) => {
for (byte b = 0; b < limits[1]; b++)
for (byte c = 0; c < limits[2]; c++)
for (byte d = 0; d < limits[3]; d++)
for (byte e = 0; e < limits[4]; e++)
for (byte f = 0; f < limits[5]; f++)
for (byte g = 0; g < limits[6]; g++)
for (byte h = 0; h < limits[7]; h++)
data.Push(new Bytes {
A = (byte)a,B = b,C = c,D = d,
E = e,F = f,G = g,H = h
});
});
Parallelized, ?? implementation Run time avg: 5s833ms, 92 contentions
var data = new ConcurrentStack<List<Bytes>>();
var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };
Parallel.For (0, limits[0], () => new List<Bytes>(),
(a, loop, localList) => {
for (byte b = 0; b < limits[1]; b++)
for (byte c = 0; c < limits[2]; c++)
for (byte d = 0; d < limits[3]; d++)
for (byte e = 0; e < limits[4]; e++)
for (byte f = 0; f < limits[5]; f++)
for (byte g = 0; g < limits[6]; g++)
for (byte h = 0; h < limits[7]; h++)
localList.Add(new Bytes {
A = (byte)a, B = b, C = c, D = d,
E = e, F = f, G = g, H = h
});
return localList;
}, x => {
data.Push(x);
});
I'm glad that I had got an implementation that is faster than the single threaded version. I expected a result closer to around 10s / 6, or around 1.6 seconds, but that's probably a naive expectation.
My question is for the parallelized implementation that is actually faster than the single-threaded version, are there further optimizations that could be a applied to the operation? I'm wondering about optimizations related to parallelization, not improvements to the algorithm used to compute the values. Specifically:
struct
instead of byte[]
, but it's not related to parallelization (or is it?)struct
optimization. First off, my initial assumption regarding Parallel.For()
and Parallel.ForEach()
was wrong.
The poor parallel implementation very likely has 6 threads all attempting to write to a single CouncurrentStack()
at once. The good implementation usuing thread locals (explained more below) only accesses the shared variable once per task, nearly eliminating any contention.
When using Parallel.For()
and Parallel.ForEach()
, you cannot simply in-line replace a for
or foreach
loop with them. That's not to say it couldn't be a blind improvement, but without examining the problem and instrumenting it, using them is throwing multithreading at a problem because it might make it faster.
**Parallel.For()
and Parallel.ForEach()
has overloads that allow you to create a local state for the Task
they ultimately create, and run an expression before and after each iteration's execution.
If you have an operation you parallelize with Parallel.For()
or Parallel.ForEach()
, it's likely a good idea to use this overload:
public static ParallelLoopResult For<TLocal>(
int fromInclusive,
int toExclusive,
Func<TLocal> localInit,
Func<int, ParallelLoopState, TLocal, TLocal> body,
Action<TLocal> localFinally
)
For example, calling For()
to sum all integers from 1 to 100,
var total = 0;
Parallel.For(0, 101, () => 0, // <-- localInit
(i, state, localTotal) => { // <-- body
localTotal += i;
return localTotal;
}, localTotal => { <-- localFinally
Interlocked.Add(ref total, localTotal);
});
Console.WriteLine(total);
localInit
should be an lambda that initializes the local state type, which is passed to the body
and localFinally
lambdas. Please note I am not recommending implementing summing 1 to 100 using parallelization, but just have a simple example to make the example short.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With