Could someone please provide some suggestions on how I can decrease the following for loop's runtime through multithreading? Suppose I also have two vectors called 'a' and 'b'.
for (int j = 0; j < 8000; j++){
// Perform an operation and store in the vector 'a'
// Add 'a' to 'b' coefficient wise
}
This for loop is executed many times in my program. The two operations in the for loop above are already optimized, but they only run on one core. However, I have 16 cores available and would like to make use of them.
I've tried modifying the loop as follows. Instead of having the vector 'a', I have 16 vectors, and suppose that the i-th one is called a[i]. My for loop now looks like
for (int j = 0; j < 500; j++){
for (int i = 0; i < 16; i++){
// Perform an operation and store in the vector 'a[i]'
}
for (int i = 0; i < 16; i++){
// Add 'a[i]' to 'b' coefficient wise
}
}
I use the OpenMp on each of the for loops inside by adding '#pragma omp parallel for' before each of the inner loops. All of my processors are in use but my runtime only increases significantly. Does anyone have any suggestions on how I can decrease the runtime of this loop? Thank You in Advance.
omp creates threads for your program whereever you insert pragma tag, so it's createing threads for inner tags but the problem is 16 threads are created, each one does 1 operation and then all of them are destroyed using your method. creating and destroying threads take a lot of time so the method you used increases the overal time of your process although it uses all 16 cores. you didn't have to create inner fors just put #pragma omp parallel for
tag before your 8000 loop it's up to omp to seperate values between treads so what you did to create the second loop, is omp's job. that way omp create threads only once and then process 500 numbers useing that each thread and end all of them after that (using 499 less thread creation and destruction)
Actually, I am going to put these comments in an answer.
Forking threads for trivial operations just adds overhead.
First, make sure your compiler is using vector instructions to implement your loop. (If it does not know how to do this, you might have to code with vector instructions yourself; try searching for "SSE instrinsics". But for this sort of simple addition of vectors, automatic vectorization ought to be possible.)
Assuming your compiler is a reasonably modern GCC, invoke it with:
gcc -O3 -march=native ...
Add -ftree-vectorizer-verbose=2
to find out whether or not it auto-vectorized your loop and why.
If you are already using vector instructions, then it is possible you are saturating your memory bandwidth. Modern CPU cores are pretty fast... If so, you need to restructure at a higher level to get more operations inside each iteration of the loop, finding ways to perform lots of operations on blocks that fit inside the L1 cache.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With