I have to compute many Fourier transforms. I would like to do these in parallel with my many cores. Note that I don't want a parallel FFT algorithm, I just want to launch many embarrassingly parallel FFTs.
I discover that, while my CPU usage goes up, my time to completion does not decrease.
We create some random data
In [1]: import numpy as np
In [2]: x = np.random.random(10000000) # some random data
And time how long it takes to compute an FFT both cold and after having computed it once.
In [3]: %time _ = np.fft.rfft(x) # cost of one run
CPU times: user 589 ms, sys: 23.9 ms, total: 612 ms
Wall time: 613 ms
In [4]: %time _ = np.fft.rfft(x) # there is some speedup from mulitple runs
CPU times: user 365 ms, sys: 12.4 ms, total: 378 ms
Wall time: 381 ms
We run this on a sequence of data sequentially
In [5]: %time _ = map(np.fft.rfft, [x] * 12) # many runs sequentially
CPU times: user 4.4 s, sys: 135 ms, total: 4.54 s
Wall time: 4.54 s
In [6]: 4.54 / 12 # Same cost per FFT
Out[6]: 0.37833333333333335
We do the same thing but now using a thread pool of four threads.
In [7]: from multiprocessing.pool import ThreadPool
In [8]: pool = ThreadPool(4) # I have four physical cores
In [9]: %time _ = pool.map(np.fft.rfft, [x] * 12)
CPU times: user 15.5 s, sys: 1.3 s, total: 16.8 s
Wall time: 4.79 s
We find that there is no speedup. However we do find that CPU usage, as measured by top
, is near 400%. This isn't an issue with the GIL. There is something about FFT that doesn't parallelize well. Perhaps we're thrashing higher level caches?
Hardware: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
Generally what's going on here and is there a way to leverage multiple cores to speed up multiple FFTs in parallel?
On my workstation, ThreadPool does provide a speed-up (though not a perfect one):
In [42]: x = np.random.random(2**23)
In [43]: %time _ = list(map(np.fft.rfft, [x]*12))
CPU times: user 3.32 s, sys: 380 ms, total: 3.7 s
Wall time: 3.7 s
In [44]: tpool = ThreadPool(4)
In [45]: %time _ = list(tpool.map(np.fft.rfft, [x]*12))
CPU times: user 5.4 s, sys: 596 ms, total: 6 s
Wall time: 1.62 s
In [46]: 3.7/4
Out[46]: 0.925
I'm using Python3, so maybe there is something there? Otherwise, its likely hardware. FFTs are memory-bound, so its quite possible that a single thread is saturating your memory system. You might be able to get better memory system locality by dropping down to an environment that let's you control affinity.
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With