Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel Sum for Vectors

Could someone please provide some suggestions on how I can decrease the following for loop's runtime through multithreading? Suppose I also have two vectors called 'a' and 'b'.

for (int j = 0; j < 8000; j++){
    // Perform an operation and store in the vector 'a'
    // Add 'a' to 'b' coefficient wise
}

This for loop is executed many times in my program. The two operations in the for loop above are already optimized, but they only run on one core. However, I have 16 cores available and would like to make use of them.

I've tried modifying the loop as follows. Instead of having the vector 'a', I have 16 vectors, and suppose that the i-th one is called a[i]. My for loop now looks like

for (int j = 0; j < 500; j++){
    for (int i = 0; i < 16; i++){
        // Perform an operation and store in the vector 'a[i]'
    }
    for (int i = 0; i < 16; i++){
        // Add 'a[i]' to 'b' coefficient wise
    }

}

I use the OpenMp on each of the for loops inside by adding '#pragma omp parallel for' before each of the inner loops. All of my processors are in use but my runtime only increases significantly. Does anyone have any suggestions on how I can decrease the runtime of this loop? Thank You in Advance.

like image 902
A-A Avatar asked Jun 05 '11 05:06

A-A


2 Answers

omp creates threads for your program whereever you insert pragma tag, so it's createing threads for inner tags but the problem is 16 threads are created, each one does 1 operation and then all of them are destroyed using your method. creating and destroying threads take a lot of time so the method you used increases the overal time of your process although it uses all 16 cores. you didn't have to create inner fors just put #pragma omp parallel for tag before your 8000 loop it's up to omp to seperate values between treads so what you did to create the second loop, is omp's job. that way omp create threads only once and then process 500 numbers useing that each thread and end all of them after that (using 499 less thread creation and destruction)

like image 136
Ali1S232 Avatar answered Nov 19 '22 05:11

Ali1S232


Actually, I am going to put these comments in an answer.

Forking threads for trivial operations just adds overhead.

First, make sure your compiler is using vector instructions to implement your loop. (If it does not know how to do this, you might have to code with vector instructions yourself; try searching for "SSE instrinsics". But for this sort of simple addition of vectors, automatic vectorization ought to be possible.)

Assuming your compiler is a reasonably modern GCC, invoke it with:

gcc -O3 -march=native ...

Add -ftree-vectorizer-verbose=2 to find out whether or not it auto-vectorized your loop and why.

If you are already using vector instructions, then it is possible you are saturating your memory bandwidth. Modern CPU cores are pretty fast... If so, you need to restructure at a higher level to get more operations inside each iteration of the loop, finding ways to perform lots of operations on blocks that fit inside the L1 cache.

like image 3
Nemo Avatar answered Nov 19 '22 06:11

Nemo