I have a bit of C++ code that has become a somewhat useful FFT library over time, and it has been made to run decently fast using SSE and AVX instructions. Granted, it's all only based on a radix-2 algorithm, but it still holds up. My latest itch to scratch is making the butterfly calculations work with FMA instructions. The basic radix-2 butterfly consists of 4 multiplies, and 6 additions or subtractions. A simple approach would involve replacing 2 of the additions and subtractions and 2 multiplies with 2 FMA instructions, resulting in a mathematically identical butterfly, but there are apparently better ways of doing this:
https://books.google.com/books?id=2HG0DwAAQBAJ&pg=PA56&lpg=PA56&dq=radix+2+fft+fma&source=bl&ots=R5XDWyYBVv&sig=ACfU3U0S2n1hcgiP63LTKMxI5Oc85eEZaQ&hl=en&sa=X&ved=2ahUKEwiz_I3PsrToAhVoHzQIHYmVDGIQ6AEwDXoECAoQAQ#v=onepage&q=radix%202%20fft%20fma&f=false
ci1 = ci1 / cr1
u0 = zinr(0)
v0 = zini(0)
r = zinr(1)
s = sini(1)
u1 = r - s * ci1
v1 = r * ci1 + s
zoutr(0) = u0 + u1 * cr1
zouti(0) = v0 + v1 * cr1
zoutr(1) = u0 - u1 * cr1
zouti(1) = v0 - v1 * cr1
The author replaces all 10 adds, subs, and mults with 6 FMA's, provided that the imaginary part of the twiddle factor is divided by the real part. Part of the text reads "Note that cr1 != 0". Which is essentially my problem in a nutshell. The math seems to work just as advertised for all twiddle factors except when the real twiddle is zero, in which case, we end up dividing by zero. Where efficiency is absolutely critical here, branching code when cr1 == 0 to a different butterfly isn't a good option, especially when we're using SIMD to process multiple twiddles and butterflies at once, where perhaps only one element of cr1 == 0. What my gut is telling me should be the case, is that when cr1 == 0, cr1 and ci1 should be some other values entirely and the FMA code will still result in the correct answer, but I cannot seem to figure this out. If I could figure it out, it would be a relatively straightforward thing to modify the precomputed twiddle factors for FMA butterflies and we also could, of course, avoid the division operation at the start of the butterfly.
The book seems to suggest that cr1 != 0
is always true. But unfortunately, it is not always the case (when the rotation angle is PI/2).
I don't think that you can solve this by adjusting the twiddle factors. The only option I see is to use some very small number instead of zero. It could work, but it's ugly, and it may cause inaccuracies in certain cases.
Possible solutions:
cr1
, divide by ci1
, and modify the forumula accordingly. This case still has a divison by zero, but it will happen at the first iteration of the loop. So instead of the center, you have to handle the first iteration specially (so only one loop is needed).Notice, that:
zoutr(1) = u0 - u1
= u0 - u1 - (u0 + u1) + (u0 + u1)
= u0 - u1 - zoutr(0) + u0 + u1
= 2*u0 - zoutr(0)
So, this operation can be done in 1 FMA.
And if you substitute u1
into the expression of zoutr(0)
:
zoutr(0) = u0 + u1
= u0 + r*cr1 - s*ci1
This can be done with 2 FMAs.
Calculating zouti
can be done in the same manner as zoutr
. So this way you need to use 6 FMA operations, which is the same amount of operations that the book has.
(Note, this doesn't mean that this variant will run faster automatically, as it has a different data dependency chain)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With