Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving the speed of FFT Implementation

Tags:

c++

fft

I'm a beginner in programming and am currently trying to work on a project requiring Fast Fourier Transform implementation.

I have so far managed to implement the following:

Does anyone have any alternatives and suggestions to improve the speed of the program without losing out on accuracy.

short FFTMethod::FFTcalc(short int dir,long m,double *x,double *y)
{
long n,i,i1,j,k,i2,l,l1,l2;
double c1,c2,tx,ty,t1,t2,u1,u2,z;

/* Calculate the number of points */
n = 1;
for (i=0;i<m;i++) 
    n *= 2;

/* Do the bit reversal */
i2 = n >> 1;
j = 0;
for (i=0;i<n-1;i++) {
  if (i < j) {
     tx = x[i];
     ty = y[i];
     x[i] = x[j];
     y[i] = y[j];
     x[j] = tx;
     y[j] = ty;
  }
  k = i2;
  while (k <= j) {
     j -= k;
     k >>= 1;
  }
  j += k;
}

/* Compute the FFT */
c1 = -1.0; 
c2 = 0.0;
l2 = 1;
for (l=0;l<m;l++) {
   l1 = l2;
   l2 <<= 1;
   u1 = 1.0; 
   u2 = 0.0;
   for (j=0;j<l1;j++) {
     for (i=j;i<n;i+=l2) {
        i1 = i + l1;
        t1 = u1 * x[i1] - u2 * y[i1];
        t2 = u1 * y[i1] + u2 * x[i1];
        x[i1] = x[i] - t1; 
        y[i1] = y[i] - t2;
        x[i] += t1;
        y[i] += t2;
     }
     z =  u1 * c1 - u2 * c2;
     u2 = u1 * c2 + u2 * c1;
     u1 = z;
   }
   c2 = sqrt((1.0 - c1) / 2.0);
   if (dir == 1) 
     c2 = -c2;
     c1 = sqrt((1.0 + c1) / 2.0);
  }

/* Scaling for forward transform */
if (dir == 1) {
   for (i=0;i<n;i++) {
      x[i] /= n;
      y[i] /= n;
   }
 } 


   return(1);
}
like image 890
sagarn Avatar asked Dec 21 '11 09:12

sagarn


2 Answers

I recently found this excellent PDF on the Construction of a high performance FFTs by Eric Postpischil. Having developed several FFTs myself I know how hard it is to compete with commercial libraries. Believe me you're doing well if your FFT is only 4x slower than Intel or FFTW, not 40x! You can however compete, and here's how.

To summarise that article, the author states that Radix2 FFTs are simple but inefficient, the most efficient construct is the radix4 FFT. An even more efficient method is the Radix8 however this often does not fit into the registers on a CPU so Radix4 is preferred.

FFTs can be constructed in stages, so to compute a 1024 point FFT you could perform 10 stages of the Radix2 FFT (as 2^10 - 1024), or 5 stages of the Radix4 FFT (4^5 = 1024). You could even compute a 1024 point FFT in stages of 8*4*4*4*2 if you so choose. Fewer stages means fewer reads and writes to memory (the bottleneck for FFT performance is memory bandwidth) hence dynamically choosing radix 4, 8 or higher is a must. The Radix4 stage is particulary efficient as all weights are 1+0i, 0+1i, -1+0i, 0-1i and Radix4 butterfly code can be written to fit entirely in the cache.

Secondly, each stage in the FFT is not the same. The first stage the weights are all equal to 1+0i. there is no point computing this weight and even multiplying by it as it is a complex multiply by 1, so the first stage may be performed without weights. The final stage may also be treated differently and can be used to perform the Decimation in Time (bit reversal). Eric Postpischil's document covers all this.

The weights may be precomputed and stored in a table. Sin/cos calculations take around 100-150 cycles each on x86 hardware so precomputing these can save 10-20% of the overall compute time as memory access is in this case faster than CPU calculations. Using fast algorithms to compute sincos in one go is particularly beneficial (Note that cos is equal to sqrt(1.0 - sine*sine), or using table lookups, cos is just a phase shift of sine).

Finally once you have your super streamlined FFT implementation you can utilise SIMD vectorization to compute 4x floating point or 2x double floating point operations per cycle inside the butterfly routine for another 100-300% speed improvement. Taking all of the above you'd have yourself a pretty slick and fast FFT!

To go further you can perform optimisation on the fly by providing different implementations of the FFT stages targeted to specific processor architectures. Cache size, register count, SSE/SSE2/3/4 instruction sets etc differ per machine so choosing a one size fits all approach is often beaten by targeted routines. In FFTW for instance many smaller size FFTs are highly optimised unrolled (no loops) implementations targeted for a specific architecture. By combining these smaller constructs (such as RadixN routines) you can choose the fastest and best routine for the task at hand.

like image 140
Dr. Andrew Burnett-Thompson Avatar answered Nov 13 '22 06:11

Dr. Andrew Burnett-Thompson


While I can't give you a performance hint right now, I'd like to give some advice for your optimization that is too long for a comment:

  1. If you haven't done so, write a number of correctness tests for your code right now. Simple tests like "do an FFT of this array and see if the results match the ones I've provided" suffice, but before you optimize code, you need a firm and automated unit test that confirms your optimized code is correct.
  2. Then profile your code to see where the actual bottleneck is. While I suspect the innermost loop for (i=j;i<n;i+=l2) {, seeing is better than believing.
like image 4
thiton Avatar answered Nov 13 '22 06:11

thiton