I'm experimenting with using cuFFT's callback feature to perform input format conversion on the fly (for instance, calculating FFTs of 8-bit integer input data without first doing an explicit conversion of the input buffer to float
). In many of my applications, I need to calculate overlapped FFTs on an input buffer, as described in this previous SO question. Typically, adjacent FFTs might overlap by 1/4 to 1/8 of the FFT length.
cuFFT, with its FFTW-like interface, explicitly supports this via the idist
parameter of the cufftPlanMany()
function. Specifically, if I want to calculate FFTs of size 32768 with an overlap of 4096 samples between consecutive inputs, I would set idist = 32768 - 4096
. This does work properly in the sense that it yields the correct output.
However, I'm seeing strange performance degradation when using cuFFT in this way. I have devised a test that implements this format conversion and overlap in two different ways:
Explicitly tell cuFFT about the overlapping nature of the input: set idist = nfft - overlap
as I described above. Install a load callback function that just does the conversion from int8_t
to float
as needed on the buffer index provided to the callback.
Don't tell cuFFT about the overlapping nature of the input; lie to it an dset idist = nfft
. Then, let the callback function handle the overlapping by calculating the correct index that should be read for each FFT input.
A test program implementing both of these approaches with timing and equivalence tests is available in this GitHub gist. I didn't reproduce it all here for brevity. The program calculates a batch of 1024 32768-point FFTs that overlap by 4096 samples; the input data type is 8-bit integers. When I run it on my machine (with a Geforce GTX 660 GPU, using CUDA 8.0 RC on Ubuntu 16.04), I get the following result:
executing method 1...done in 32.523 msec
executing method 2...done in 26.3281 msec
Method 2 is noticeably faster, which I would not expect. Look at the implementations of the callback functions:
Method 1:
template <typename T>
__device__ cufftReal convert_callback(void * inbuf, size_t fft_index,
void *, void *)
{
return (cufftReal)(((const T *) inbuf)[fft_index]);
}
Method 2:
template <typename T>
__device__ cufftReal convert_and_overlap_callback(void *inbuf,
size_t fft_index, void *, void *)
{
// fft_index is the index of the sample that we need, not taking
// the overlap into account. Convert it to the appropriate sample
// index, considering the overlap structure. First, grab the FFT
// parameters from constant memory.
int nfft = overlap_params.nfft;
int overlap = overlap_params.overlap;
// Calculate which FFT in the batch that we're reading data for. This
// tells us how much overlap we need to account for. Just use integer
// arithmetic here for speed, knowing that this would cause a problem
// if we did a batch larger than 2Gsamples long.
int fft_index_int = fft_index;
int fft_batch_index = fft_index_int / nfft;
// For each transform past the first one, we need to slide "overlap"
// samples back in the input buffer when fetching the sample.
fft_index_int -= fft_batch_index * overlap;
// Cast the input pointer to the appropriate type and convert to a float.
return (cufftReal) (((const T *) inbuf)[fft_index_int]);
}
Method 2 has a significantly more complex callback function, one that even involves integer division by a non-compile time value! I would expect this to be much slower than method 1, but I'm seeing the opposite. Is there a good explanation for this? Is it possible that cuFFT structures its processing much differently when the input overlaps, thus resulting in the degraded performance?
It seems like I should be able to achieve performance that is quite a bit faster than method 2 if the index calculations could be removed from the callback (but that would require the overlapping to be specified to cuFFT).
Edit: After running my test program under nvvp
, I can see that cuFFT definitely seems to be structuring its computations differently. It's hard to make sense of the kernel symbol names, but the kernel invocations break down like this:
Method 1:
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex14packR2C_kernelIjfEEvNS_19spRealComplexR2C_stIT_T0_EE
: 3.72 msecspRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>
: 7.71 msecspRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>
: 12.75 msec (yes, it gets invoked twice)__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_
: 7.49 msecMethod 2:
spRadix0128C::kernel1MemCallback<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, L1, ALL, WRITEBACK>
: 5.15 msecspRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>
: 12.88 msec__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_
: 7.51 msecInterestingly, it looks like cuFFT invokes two kernels to actually compute the FFTs using method 1 (when cuFFT knows about the overlapping), but with method 2 (where it doesn't know that the FFTs are overlapped), it does the job with just one. For the kernels that are used in both cases, it does seem to use the same grid parameters between methods 1 and 2.
I don't see why it should have to use a different implementation here, especially since the input stride istride == 1
. It should just use a different base address when fetching data at the transform input; the rest of the algorithm should be exactly the same, I think.
Edit 2: I'm seeing some even more bizarre behavior. I realized by accident that if I fail to destroy the cuFFT handles appropriately, I see differences in measured performance. For example, I modified the test program to skip destruction of the cuFFT handles and then executed the tests in a different sequence: method 1, method 2, then method 2 and method 1 again. I got the following results:
executing method 1...done in 31.5662 msec
executing method 2...done in 17.6484 msec
executing method 2...done in 17.7506 msec
executing method 1...done in 20.2447 msec
So the performance seems to change depending upon whether there are other cuFFT plans in existence when creating a plan for the test case! Using the profiler, I see that the structure of the kernel launches doesn't change between the two cases; the kernels just all seem to execute faster. I have no reasonable explanation for this effect either.
If you specify non-standard strides (doesn't matter if batch/transform) cuFFT uses different path internally.
ad edit 2: This is likely GPU Boost adjusting clocks on GPU. cuFFT plan do not have impact one on another
Ways to get more stable results:
At the suggestion of @llukas, I filed a bug report with NVIDIA regarding the issue (https://partners.nvidia.com/bug/viewbug/1821802 if you're registered as a developer). They acknowledged the poorer performance with overlapped plans. They actually indicated that the kernel configuration used in both cases is suboptimal and they plan to improve that eventually. No ETA was given, but it will likely not be in the next release (8.0 was just released last week). Finally, they said that as of CUDA 8.0, there is no workaround to make cuFFT use a more efficient method with strided inputs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With