Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenMP For - group loops for cache optimization

I working to adapt a program to use OpenMP. I have a group of nested for loops. The outermost for loop is a y-axis loops that goes down an image. I would like to run multiple parallel threads on the loop, but I'm having trouble making it fast.

Currently when I run 8 threads it runs like:

thread 0 -> row 0,8,16...
thread 1 -> row 1,9,17...
thread 2 -> row 2,10,18...
thread 3 -> row 3,11,19...

I would like it to run in blocks, so that thread 0 does the first 1/8 of the rows. What is the best way to do this?

Current code:

...
int y_percent = data_size_Y/8;
int thread = 0;

#pragma omp parallel for num_threads(8) firstprivate(vecs, bufferedOut,data_size_X, data_size_Y, kern_cent_X, kern_cent_Y, sum)
for(int y = y_percent*omp_get_thread_num(); y < (omp_get_thread_num()+1)*y_percent; y++){ // the y coordinate of theoutput location we're focusing on     
like image 999
Marcus Avatar asked Apr 10 '26 07:04

Marcus


2 Answers

You can use the schedule clause on the pragma statement to specify the chunk size that you are wanting each thread to process. In the example below, I specify the static scheduling method with a chunk size that specifies the number of contiguous iterations each thread should get. In this simple example, each thread will get chunks of 8 iterations each (e.g. thread 0 will get iterations 0-7, thread 1 iterations 8-15, etc). It is worth pointing out that if you aren't concerned with the ordering of chunk distribution (e.g. if you don't care if thread 0 gets the first chunk or not), you can replace static with dynamic. dynamic gives the ability to assign chunks to threads as they need them instead of preassigning chunks to threads from the start (useful for load balancing when some iterations take longer than others). For more information on the scheduling methods, check out the following:

  • Wikipedia article - Scheduling Clauses
  • LLNL docs - DO/for Directive

Example:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int main() {
  int i;
  int iterations = 32;
  int num_threads = 4;

#pragma omp parallel for schedule(static, 8) num_threads(num_threads)
  for(i=0; i<iterations; i++) {
    printf("thread %d: %d\n", omp_get_thread_num(), i);
  }

}
like image 118
codehathi Avatar answered Apr 11 '26 20:04

codehathi


You could simply use the following code to achieve that.

#pragma omp parallel for num_threads(8)
for(int y = 0; y < data_size_Y; y++) {
    ....
}

Generally I think the long list of firstprivate is not necessary. Depending on how you exactly use those variables, most of them should be able to be defined as shared.

like image 24
kangshiyin Avatar answered Apr 11 '26 20:04

kangshiyin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!