Is it worth offloading FFT computation to an embedded GPU?

Tags:

We are considering porting an application from a dedicated digital signal processing chip to run on generic x86 hardware. The application does a lot of Fourier transforms, and from brief research, it appears that FFTs are fairly well suited to computation on a GPU rather than a CPU. For example, this page has some benchmarks with a Core 2 Quad and a GF 8800 GTX that show a 10-fold decrease in calculation time when using the GPU:

http://www.cv.nrao.edu/~pdemores/gpu/

However, in our product, size constraints restrict us to small form factors such as PC104 or Mini-ITX, and thus to rather limited embedded GPUs.

Is offloading computation to the GPU something that is only worth doing with meaty graphics cards on a proper PCIe bus, or would even embedded GPUs offer performance improvements?

203

asked Nov 16 '11 21:11

Ian Renton

1 Answers

Having developed FFT routines both on x86 hardware and GPUs (prior to CUDA, 7800 GTX Hardware) I found from my own results that with smaller sizes of FFT (below 2^13) that the CPU was faster. Above these sizes the GPU was faster. For instance, a 2^16 sized FFT computed an 2-4x more quickly on the GPU than the equivalent transform on the CPU. See a table of times below (All times are in seconds, comparing a 3GHz Pentium 4 vs. 7800GTX. This work was done back in 2005 so old hardware and as I said, non CUDA. Newer libraries may show larger improvements)

N       FFTw (s)    GPUFFT (s)  GPUFFT MFLOPS   GPUFFT Speedup
8       0           0.00006     3.352705        0.006881
16      0.000001    0.000065    7.882117        0.010217
32      0.000001    0.000075    17.10887        0.014695
64      0.000002    0.000085    36.080118       0.026744
128     0.000004    0.000093    76.724324       0.040122
256     0.000007    0.000107    153.739856      0.066754
512     0.000015    0.000115    320.200892      0.134614
1024    0.000034    0.000125    657.735381      0.270512
2048    0.000076    0.000156    1155.151507     0.484331
4096    0.000173    0.000215    1834.212989     0.804558
8192    0.000483    0.00032     2664.042421     1.510011
16384   0.001363    0.000605    3035.4551       2.255411
32768   0.003168    0.00114     3450.455808     2.780041
65536   0.008694    0.002464    3404.628083     3.528726
131072  0.015363    0.005027    3545.850483     3.05604
262144  0.033223    0.012513    3016.885246     2.655183
524288  0.072918    0.025879    3079.443664     2.817667
1048576 0.173043    0.076537    2192.056517     2.260904
2097152 0.331553    0.157427    2238.01491      2.106081
4194304 0.801544    0.430518    1715.573229     1.861814

As suggested by other posters the transfer of data to/from the GPU is the hit you take. Smaller FFTs can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small FFTs (below ~1024 points). If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.

I would suggest using FFTW if you want a fast FFT implementation, or the Intel Math Library if you want an even faster (commercial) implementation. For FFTW, performing plans using the FFTW_Measure flag will measure and test the fastest possible FFT routine for your specific hardware. I go into detail about this in this question.

For GPU implementations you can't get better than the one provided by NVidia CUDA. The performance of GPUs has increased significantly since I did my experiments on a 7800GTX so I would suggest giving their SDK a go for your specific requirement.

160

answered Nov 28 '22 16:11

Dr. Andrew Burnett-Thompson

Related questions
                            
                                Alternative to boost::shared_ptr in an embedded environment
                            
                                embedded Java VM for Cortex M3
                            
                                Embedded programming ... very beginning [closed]
                            
                                Why would a region of memory be marked non-cached?
                            
                                What are the prerequisites for learning embedded systems programming? [closed]
                            
                                recursive folder scanning in c++
                            
                                What language to learn for microcontroller programming? [closed]
                            
                                CAN communication between LPC 2292 and LPC1758 boards "Start of Frame " error
                            
                                How can I use an SD card for logging 16-bit data at 48 ksamples/s?
                            
                                How do you organize code in embedded projects?
                            
                                How to move from microcontrollers to embedded linux?
                            
                                Testing Code for Embedded Application
                            
                                How would you approach using D in a embedded real-time environment?
                            
                                Game Boy: Half-carry flag and 16-bit instructions (especially opcode 0xE8)
                            
                                STM32 - How to enable DWT Cycle counter
                            
                                How is the BIOS ROM mapped into address space on PC?
                            
                                How to perform regression tests in embedded systems
                            
                                Bitwise transpose of 8 bytes
                            
                                Free alternative to MPLAB (PIC development) [closed]
                            
                                Cycle counter on ARM Cortex M4 (or M3)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it worth offloading FFT computation to an embedded GPU?

Tags:

embedded

gpgpu

gpu

fft

Ian Renton

People also ask

1 Answers

Dr. Andrew Burnett-Thompson

Recent Activity

Donate For Us