I am writing a C++ program that does a brute-force search for closed Knight's tours. The code is here.
I would like to parallelize this using OpenMP. My problem is to do this in a way that creates a sufficient degree of parallelism. Currently the relevant portion of my code looks like this
#pragma omp parallel for reduction(+:count) if (depth==4)
  for (size_t i=0;i<g.neighbours[last].size();i++){
    auto n = g.neighbours[last][i];
    // See if n can be used to extend or complete the tour
The if (depth==4) is my attempt to make sure that not too many parallel tasks are created but on the other hand enough are created to keep all processors busy. Setting depth==2 does not change the runtime of the program.
This does not seem to be working out. For a 3x12 problem, on my dual-core processor the total CPU time consumed by the OpenMP version is around 130s whereas a single-threaded version without OpenMP takes around 40s of CPU time.
I would appreciate suggestions on how better to use OpenMP or reasons why it is unsuitable for this problem.
UPDATE: Thanks to @Zulan I have a newer version using OpenMP tasks with a much faster sequential performance as well as good parallelization.
First, lets take a super-quick look at where your application spends its time, by using perf:
perf record ./knights_tour_omp 3 12
perf report
# Overhead  Command          Shared Object        Symbol                                                                                         
# ........  ...............  ...................  .................................................................................................
#
    53.29%  knights_tour_om  libc-2.19.so         [.] malloc                                                                                       
    23.16%  knights_tour_om  libc-2.19.so         [.] _int_free                                                                                    
    10.31%  knights_tour_om  libc-2.19.so         [.] _int_malloc                                                                                  
     4.78%  knights_tour_om  knights_tour_omp     [.] _Z13continue_tourIZ4mainEUlRKSt6vectorIiSaIiEEE_EmRK5GraphS4_RKS0_IcSaIcEEiiRT_.constprop.119
     2.64%  knights_tour_om  libc-2.19.so         [.] __memmove_ssse3_back                                                                         
     1.48%  knights_tour_om  libc-2.19.so         [.] malloc_consolidate                                                                 
You application spends all it's time allocating and freeing memory. While there are some report that malloc is is not fully locked, it doesn't seem to parallelize nicely either.
It doesn't need much further investigation to figure out that this is caused by copying the visited and path vectors for each iteration. Fortunately DFS has the nice property, that you don't necessarily need to do that, you can reuse the state and restore it:
visited[n] = 1;
path.emplace_back(n);
count += continue_tour(g, path, visited, depth+1,remain-1, cb);
visited[n] = 0;
path.pop_back();
However, for the parallel loop, you need to make working copies for each thread, so you need to make a separate path for this case with the original behavior.
This small change already brings down the serial runtime from 22 s to 2.65 s. Further with 2 threads it goes down to 2.0 s. There is a speedup, but it's not so good - why? To illustrate that, I used 4 threads (on a large-enough system). Here is the execution of the application over time recorded with score-p, shown in Vampir.

Red is actual work, blue is a implicit barrier, meaning the threads are idling. There always seem to be either 3 or 1 active thread. The reason is simple: Most of the position on the board have either 4 or 2 neighbors, but one of them is already visited so this loop iteration completes instantly. So the actual work is in either 3 or 1 iteration of the loop. That is a particularly bad match for 2 threads. With 3 threads, the runtime goes down to 1.7 s
Note: All times with gcc 5.3 on an E5-2690 @ 2.90 GHz no turbo. No care was taken to compensate for variance in runtimes.
Considering that a single loop exposes very bad parallelism for that problem, you might be tempted to nest parallel loops. I encourage you to try it, but I don't think that will work out well. I would think that tasks work well in this context, because tasks can spawn other tasks. So you can spawn each recursive step as a task and let the OMP runtime figure out a good scheduling. But make sure to limit it up to a certain depth, so that tasks do not get too short.
I want to stress the importance of using tools: Trying to figure out performance issues without performance analysis tools is like figuring out bugs without a debugger. perf is readily available for Linux and in it's basic form extremely simple to use. Yet it is very powerful (although advanced usage can have some pitfalls).
An additional remark: With OpenMP it is often worthwile to try different compilers. For instance with the (fixed) tasking code, gcc does not scale anymore beyond 4 threads, while the intel compiler / omp runtime provides a speedup even up to 24 threads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With