Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increased occupancy without speedup in FFT

Problem

I have to compute many Fourier transforms. I would like to do these in parallel with my many cores. Note that I don't want a parallel FFT algorithm, I just want to launch many embarrassingly parallel FFTs.

I discover that, while my CPU usage goes up, my time to completion does not decrease.

Example

We create some random data

In [1]: import numpy as np

In [2]: x = np.random.random(10000000)  # some random data

And time how long it takes to compute an FFT both cold and after having computed it once.

In [3]: %time _ = np.fft.rfft(x)        # cost of one run
CPU times: user 589 ms, sys: 23.9 ms, total: 612 ms
Wall time: 613 ms

In [4]: %time _ = np.fft.rfft(x)        # there is some speedup from mulitple runs
CPU times: user 365 ms, sys: 12.4 ms, total: 378 ms
Wall time: 381 ms

We run this on a sequence of data sequentially

In [5]: %time _ = map(np.fft.rfft, [x] * 12)  # many runs sequentially
CPU times: user 4.4 s, sys: 135 ms, total: 4.54 s
Wall time: 4.54 s

In [6]: 4.54 / 12                       # Same cost per FFT
Out[6]: 0.37833333333333335

We do the same thing but now using a thread pool of four threads.

In [7]: from multiprocessing.pool import ThreadPool

In [8]: pool = ThreadPool(4)            # I have four physical cores

In [9]: %time _ = pool.map(np.fft.rfft, [x] * 12)
CPU times: user 15.5 s, sys: 1.3 s, total: 16.8 s
Wall time: 4.79 s

We find that there is no speedup. However we do find that CPU usage, as measured by top, is near 400%. This isn't an issue with the GIL. There is something about FFT that doesn't parallelize well. Perhaps we're thrashing higher level caches?

Hardware: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz

Question

Generally what's going on here and is there a way to leverage multiple cores to speed up multiple FFTs in parallel?

like image 665
MRocklin Avatar asked Oct 30 '22 18:10

MRocklin


1 Answers

On my workstation, ThreadPool does provide a speed-up (though not a perfect one):

In [42]: x = np.random.random(2**23)

In [43]: %time _ = list(map(np.fft.rfft, [x]*12))
CPU times: user 3.32 s, sys: 380 ms, total: 3.7 s
Wall time: 3.7 s

In [44]: tpool = ThreadPool(4)

In [45]: %time _ = list(tpool.map(np.fft.rfft, [x]*12))
CPU times: user 5.4 s, sys: 596 ms, total: 6 s
Wall time: 1.62 s

In [46]: 3.7/4
Out[46]: 0.925

I'm using Python3, so maybe there is something there? Otherwise, its likely hardware. FFTs are memory-bound, so its quite possible that a single thread is saturating your memory system. You might be able to get better memory system locality by dropping down to an environment that let's you control affinity.

Hardware

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz.

like image 58
Max Hutchinson Avatar answered Nov 03 '22 00:11

Max Hutchinson