Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this parallel code in D scale so badly?

Here is one experiment I performed comparing parallelism in C++ and D. I implemented an algorithm (a parallel label propagation scheme for community detection in networks) in both languages, using the same design: A parallel iterator gets a handle function (normally a closure) and applies it for every node in the graph.

Here is the iterator in D, implemented using taskPool from std.parallelism:

/**
 * Iterate in parallel over all nodes of the graph and call handler (lambda closure).
 */
void parallelForNodes(F)(F handle) {
    foreach (node v; taskPool.parallel(std.range.iota(z))) {
        // call here
        handle(v);
    }
}

And this is the handle function which is passed:

    auto propagateLabels = (node v){
        if (active[v] && (G.degree(v) > 0)) {
            integer[label] labelCounts;

            G.forNeighborsOf(v, (node w) {
                label lw = labels[w];
                labelCounts[lw] += 1; // add weight of edge {v, w}
            });

            // get dominant label
            label dominant;
            integer lcmax = 0;
            foreach (label l, integer lc; labelCounts) {
                if (lc > lcmax) {
                    dominant = l;
                    lcmax = lc;
                }
            }

        if (labels[v] != dominant) { // UPDATE
            labels[v] = dominant;
            nUpdated += 1; // TODO: atomic update?
            G.forNeighborsOf(v, (node u) {
                active[u] = 1;
            });
        } else {
            active[v] = 0;
        }

        }
    };

The C++11 implementation is almost identical, but uses OpenMP for parallelization. So what do the scaling experiments show?

scaling

Here I examine weak scaling, doubling the input graph size while also doubling the number of threads and measuring the running time. The ideal would be a straight line, but of course there is some overhead for parallelism. I use defaultPoolThreads(nThreads) in my main function to set the number of threads for the D program. The curve for C++ looks good, but the curve for D looks surprisingly bad. Am I doing something wrong w.r.t. D parallelism, or does this reflect badly on the scalability of parallel D programs?

p.s. compiler flags

for D: rdmd -release -O -inline -noboundscheck

for C++: -std=c++11 -fopenmp -O3 -DNDEBUG

pps. Something must be really wrong, because the D implementation is slower in parallel than sequentially:

enter image description here

ppps. For the curious, here are the Mercurial clone urls for both implementations:

  • http://algohub.iti.kit.edu/parco/Prototypes/PLPd
  • http://algohub.iti.kit.edu/parco/Prototypes/PLPcpp
like image 542
clstaudt Avatar asked Jul 30 '13 19:07

clstaudt


2 Answers

As Peter Alexander points out, your algorithm appears to be thread-unsafe. To make it thread-safe, you need to eliminate all data dependencies between events that may occur in different threads simultaneously or in an undefined order. One way to do this is to replicate some state across threads using WorkerLocalStorage (provided in std.parallelism), and possibly merge the results in a relatively cheap loop at the end of your algorithm.

In some cases the process of replicating this state can be automated by writing the algorithm as a reduction and using std.parallelism.reduce (possibly in combination with std.algorithm.map or std.parallelism.map) to do the heavy lifting.

like image 23
dsimcha Avatar answered Oct 18 '22 00:10

dsimcha


It's hard to say because I don't fully understand how your algorithm is supposed to work, but it looks like your code is not thread-safe, which is causing the algorithm to run more iterations than necessary.

I added this to the end of PLP.run:

writeln(nIterations);

With 1 thread nIterations = 19
With 10 threads nIterations = 34
With 100 threads nIterations = 90

So as you can see, it's taking longer not because of some issues with std.parallelism, but simply because it's doing more work.

Why is your code not thread-safe?

The function you run in parallel is propagateLabels, which has shared, unsynchronised access to labels, nUpdated, and active. Who knows what bizarre behaviour this is causing, but it can't be good.

Before you start profiling, you need to fix the algorithm to be thread-safe.

like image 190
Peter Alexander Avatar answered Oct 17 '22 23:10

Peter Alexander