Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FFTW plan creation using OpenMP

I am trying to perform several FFT's in parallel. I am using FFTW and OpenMP. Each FFT is different, so I'm not relying on FFTW's build-in multithreading (which I know uses OpenMP).

int m;

// assume:
// int numberOfColumns = 100;
// int numberOfRows = 100;

#pragma omp parallel for default(none) private(m) shared(numberOfColumns, numberOfRows)//  num_threads(4)
    for(m = 0; m < 36; m++){

        // create pointers
        double          *inputTest;
        fftw_complex    *outputTest;
        fftw_plan       testPlan;

        // preallocate vectors for FFTW
         outputTest = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfRows*numberOfColumns);
         inputTest  = (double *)fftw_malloc(sizeof(double)*numberOfRows*numberOfColumns);

         // confirm that preallocation worked
         if (inputTest == NULL || outputTest == NULL){
             logger_.log_error("\t\t FFTW memory not allocated on m = %i", m);
         }

         // EDIT: insert data into inputTest
         inputTest = someDataSpecificToThisIteration(m); // same size for all m

        // create FFTW plan
        #pragma omp critical (make_plan)
        {
            testPlan = fftw_plan_dft_r2c_2d(numberOfRows, numberOfColumns, inputTest, outputTest, FFTW_ESTIMATE);
        }

         // confirm that plan was created correctly
         if (testPlan == NULL){
             logger_.log_error("\t\t failed to create plan on m = %i", m);
         }

        // execute plan
         fftw_execute(testPlan);

        // clean up
         fftw_free(inputTest);
         fftw_free(outputTest);
         fftw_destroy_plan(testPlan);

    }// end parallelized for loop

This all works fine. However, if I remove the critical construct from around the plan creation (fftw_plan_dft_r2c_2d) my code will fail. Can someone explain why? fftw_plan_dft_r2c_2d isn't really an "orphan", right? Is it because two threads might both try to hit the numberOfRows or numberOfColumns memory location at the same time?

like image 296
tir38 Avatar asked Feb 21 '13 20:02

tir38


1 Answers

It's pretty much all written in the FFTW documentation about thread safety:

... but some care must be taken because the planner routines share data (e.g. wisdom and trigonometric tables) between calls and plans.

The upshot is that the only thread-safe (re-entrant) routine in FFTW is fftw_execute (and the new-array variants thereof). All other routines (e.g. the planner) should only be called from one thread at a time. So, for example, you can wrap a semaphore lock around any calls to the planner; even more simply, you can just create all of your plans from one thread. We do not think this should be an important restriction (FFTW is designed for the situation where the only performance-sensitive code is the actual execution of the transform), and the benefits of shared data between plans are great.

In a typical application of FFT plans are constructed seldom, so it doesn't really matter if you have to synchronise their creation. In your case you don't need to create a new plan at each iteration, unless the dimension of the data changes. You would rather do the following:

#pragma omp parallel default(none) private(m) shared(numberOfColumns, numberOfRows)
{
   // create pointers
   double          *inputTest;
   fftw_complex    *outputTest;
   fftw_plan       testPlan;

   // preallocate vectors for FFTW
   outputTest = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfRows*numberOfColumns);
   inputTest  = (double *)fftw_malloc(sizeof(double)*numberOfRows*numberOfColumns);

   // confirm that preallocation worked
   if (inputTest == NULL || outputTest == NULL){
      logger_.log_error("\t\t FFTW memory not allocated on m = %i", m);
   }

   // create FFTW plan
   #pragma omp critical (make_plan)
   testPlan = fftw_plan_dft_r2c_2d(numberOfRows, numberOfColumns, inputTest, outputTest, FFTW_ESTIMATE);

   #pragma omp for
   for (m = 0; m < 36; m++) {
      // execute plan
      fftw_execute(testPlan);
   }

   // clean up
   fftw_free(inputTest);
   fftw_free(outputTest);
   fftw_destroy_plan(testPlan);
}

Now the plans are created only once in each thread and the serialisation overhead would diminish with each execution of fftw_execute(). If running on a NUMA system (e.g. a multi-socket AMD64 or Intel (post-)Nehalem system), then you should enable thread binding in order to achieve maximum performance.

like image 54
Hristo Iliev Avatar answered Nov 12 '22 12:11

Hristo Iliev