Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

N-Body problen: Efficient parallelization of the double for loop

A very common problem on a N-body problem is the use of a double cycle to calculate the interactions between the particles. Considering a N body problem with n particles, the cycle can be written has

for (i = 0, i < n; i++)
    for (j = i+1, j < n; j++)
        // calculate interaction

My question is about how can this cycle be parallelized using different threads. The objective is that each thread would "ideally" have to calculate the same number of interactions.

My idea was to separate the outer cycle, the i-cycle, on different intervals, say a_k=a(k), where k = 1,2,...,p where p is the number of threads we want to divide the problem into.

So, the cycle could be written as

for (k = 1, k < p; k++)
    for (i = a(k), i < a(k+1); i++)
        for (j = i+1, j < n; j++)
            // calculate interaction

Where the most outer cycle, the k-cycle, is the one to be parallelized.

Because the number of interactions of the most inner cycle, the j-cycle, is n-(i+1), the number of interactions calculated by each thread is

\sum_{i=a(k)}^{a(k+1)} n - (i+1)

This means that one would like find the discrete function a_k such that the functional

f[a_k] = \sum_{i=a(k)}^{a(k+1)} n - (i+1)

with the boundary conditions a(1)=0 and a(p)=n is a constant functional, thus forcing the number of interactions on each thread to be the same.

I've tried using different "heuristics"(e.g. a_k polynomial, exponential, log), and so far none have gave me a satisfactory answer. A direct solution of this problem is not evident to me.

For small p, this problem can be put on the "minimization sack problems" where basically each a_k is a variable to minimize the function

f(a_1,a_2,a_3,...) = sum(|f[a_k] - n/p|^2)

But has you might guess, this is not efficient (or even converge) for higher values of p.

Does anyone have any idea of how could this problem be tackled?

like image 389
Jorge Leitao Avatar asked May 09 '12 22:05

Jorge Leitao


2 Answers

(Sorry if this is not expressed clearly, it makes sense in my head).

When adding up all the numbers from 1 to N, you can notice that N + 1 = (N - 1) + 2 = (N - 2) + 3, etc.

So, what if each thread used one small i and one large i, such that the sums always added up?

Or, say you wanted to always use 5 threads. Thread 1 would do the first 10% and the last 10%, thread 2 would do the second 10% and second-to-last 10%, and so on. Each pairing of an 'early' and 'late' section would add up to the same total number of interactions.

EDIT:

Stealing a diagram from another post...

   0 1 2 3 4 5 6 7 8

0  - A B C D D C B A
1    - B C D D C B A  
2      - C D D C B A
3        - D D C B A  
4          - D C B A
5            - C B A
6              - B A
7                - A
8                  -

Does that show more clearly what I mean?

like image 64
DGH Avatar answered Oct 19 '22 16:10

DGH


You can divide your objects into k groups of roughly N/k bodies, and use this to dissect your initial triangle of interactions into k*(k + 1)/2 pieces:

   0 1 2 3 4 5 6 7 8
                      -- N=9;  k=3;  N/k=3
0  - A A B B B C C C
1    - A B B B C C C  -- diagonal pieces:  A, D, F
2      - B B B C C C
3        - D D E E E  -- non-diagonal pieces: B, C, E
4          - D E E E
5            - E E E
6              - F F
7                - F
8                  -

This view is complicated by the fact that there are two kinds of pieces: those along the diagonal (which are triangles with (N/k)*(N/k - 1)/2 elements), and those which are not (which are squares with (N/k)*(N/k) elements). However, since the diagonal pieces are roughly half the size of the square pieces, you can assign two to each thread to balance the load -- for a total of k*k/2 roughly-equal tasks.

An advantage of this method is that each task only needs to access the data for 2*N/k bodies, which could make it significantly more cache-friendly.

like image 3
comingstorm Avatar answered Oct 19 '22 17:10

comingstorm