Here is one experiment I performed comparing parallelism in C++ and D. I implemented an algorithm (a parallel label propagation scheme for community detection in networks) in both languages, using the same design: A parallel iterator gets a handle function (normally a closure) and applies it for every node in the graph.
Here is the iterator in D, implemented using taskPool
from std.parallelism
:
/**
* Iterate in parallel over all nodes of the graph and call handler (lambda closure).
*/
void parallelForNodes(F)(F handle) {
foreach (node v; taskPool.parallel(std.range.iota(z))) {
// call here
handle(v);
}
}
And this is the handle function which is passed:
auto propagateLabels = (node v){
if (active[v] && (G.degree(v) > 0)) {
integer[label] labelCounts;
G.forNeighborsOf(v, (node w) {
label lw = labels[w];
labelCounts[lw] += 1; // add weight of edge {v, w}
});
// get dominant label
label dominant;
integer lcmax = 0;
foreach (label l, integer lc; labelCounts) {
if (lc > lcmax) {
dominant = l;
lcmax = lc;
}
}
if (labels[v] != dominant) { // UPDATE
labels[v] = dominant;
nUpdated += 1; // TODO: atomic update?
G.forNeighborsOf(v, (node u) {
active[u] = 1;
});
} else {
active[v] = 0;
}
}
};
The C++11 implementation is almost identical, but uses OpenMP for parallelization. So what do the scaling experiments show?
Here I examine weak scaling, doubling the input graph size while also doubling the number of threads and measuring the running time. The ideal would be a straight line, but of course there is some overhead for parallelism. I use defaultPoolThreads(nThreads)
in my main function to set the number of threads for the D program. The curve for C++ looks good, but the curve for D looks surprisingly bad. Am I doing something wrong w.r.t. D parallelism, or does this reflect badly on the scalability of parallel D programs?
p.s. compiler flags
for D: rdmd -release -O -inline -noboundscheck
for C++: -std=c++11 -fopenmp -O3 -DNDEBUG
pps. Something must be really wrong, because the D implementation is slower in parallel than sequentially:
ppps. For the curious, here are the Mercurial clone urls for both implementations:
As Peter Alexander points out, your algorithm appears to be thread-unsafe. To make it thread-safe, you need to eliminate all data dependencies between events that may occur in different threads simultaneously or in an undefined order. One way to do this is to replicate some state across threads using WorkerLocalStorage
(provided in std.parallelism), and possibly merge the results in a relatively cheap loop at the end of your algorithm.
In some cases the process of replicating this state can be automated by writing the algorithm as a reduction and using std.parallelism.reduce
(possibly in combination with std.algorithm.map
or std.parallelism.map
) to do the heavy lifting.
It's hard to say because I don't fully understand how your algorithm is supposed to work, but it looks like your code is not thread-safe, which is causing the algorithm to run more iterations than necessary.
I added this to the end of PLP.run
:
writeln(nIterations);
With 1 thread nIterations = 19
With 10 threads nIterations = 34
With 100 threads nIterations = 90
So as you can see, it's taking longer not because of some issues with std.parallelism
, but simply because it's doing more work.
Why is your code not thread-safe?
The function you run in parallel is propagateLabels
, which has shared, unsynchronised access to labels
, nUpdated
, and active
. Who knows what bizarre behaviour this is causing, but it can't be good.
Before you start profiling, you need to fix the algorithm to be thread-safe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With