Numpy mean of flattened large array slower than mean of mean of all axes

Tags:

Running Numpy version 1.19.2, I get better performance cumulating the mean of every individual axis of an array than by calculating the mean over an already flattened array.

shape = (10000,32,32,3)
mat = np.random.random(shape)

# Call this Method A.
%%timeit
mat_means = mat.mean(axis=0).mean(axis=0).mean(axis=0)

14.6 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

mat_reshaped = mat.reshape(-1,3)

# Call this Method B
%%timeit
mat_means = mat_reshaped.mean(axis=0)

135 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is odd, since doing the mean multiple times has the same bad access pattern (perhaps even worse) than the one on the reshaped array. We also do more operations this way. As a sanity check, I converted the array to FORTRAN order:

mat_reshaped_fortran = mat.reshape(-1,3, order='F')

%%timeit
mat_means = mat_reshaped_fortran.mean(axis=0)

12.2 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This yields the performance improvement I expected.

For Method A, prun gives:

36 function calls in 0.019 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3    0.018    0.006    0.018    0.006 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.019    0.019 {built-in method builtins.exec}
        3    0.000    0.000    0.019    0.006 _methods.py:143(_mean)
        3    0.000    0.000    0.000    0.000 _methods.py:59(_count_reduce_items)
        1    0.000    0.000    0.019    0.019 <string>:1(<module>)
        3    0.000    0.000    0.019    0.006 {method 'mean' of 'numpy.ndarray' objects}
        3    0.000    0.000    0.000    0.000 _asarray.py:86(asanyarray)
        3    0.000    0.000    0.000    0.000 {built-in method numpy.array}
        3    0.000    0.000    0.000    0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

While for Method B:

    14 function calls in 0.166 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.166    0.166    0.166    0.166 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.166    0.166 {built-in method builtins.exec}
        1    0.000    0.000    0.166    0.166 _methods.py:143(_mean)
        1    0.000    0.000    0.000    0.000 _methods.py:59(_count_reduce_items)
        1    0.000    0.000    0.166    0.166 <string>:1(<module>)
        1    0.000    0.000    0.166    0.166 {method 'mean' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 _asarray.py:86(asanyarray)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.array}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Note: np.setbufsize(1e7) doesn't seem to have any effect.

What is the reason for this performance difference?

232

asked Dec 14 '20 21:12

Ganea Dan Andrei

1 Answers

Let's call your original matrix mat. mat.shape = (10000,32,32,3). Visually, this is like having a "stack" of 10,000 * 32x32x3 * rectangular prisms (I think of them as LEGOs) of floats.

Now lets think about what you did in terms of floating point operations (flops):

In Method A, you do mat.mean(axis=0).mean(axis=0).mean(axis=0). Let's break this down:

You take the mean of each position (i,j,k) across all 10,000 LEGOs. This gives you back a single LEGO of size 32x32x3 which now contains the first set of means. This means you have performed 10,000 additions and 1 division per mean, of which there are 32323 = 3072. In total, you've done 30,723,072 flops.
You then take the mean again, this time of each position (j,k), where i is now the number of the layer (vertical position) you are currently on. This gives you a piece of paper with 32x3 means written on it. You have performed 32 additions and 1 divisions per mean, of which there are 32*3 = 96. In total, you've done 3,168 flops.
Finally, you take the mean of each column k, where j is now the row you are currently on. This gives you a stub with 3 means written on it. You have performed 32 additions and 1 division per mean, of which there are 3. In total, you've done 99 flops.

The grand total of all this is 30,723,072 + 3,168 + 99 = 30,726,339 flops.

In Method B, you do mat_reshaped = mat.reshape(-1,3); mat_means = mat_reshaped.mean(axis=0). Let's break this down:

You reshaped everything, so mat is a long roll of paper of size 10,240,000x3. You take the mean of each column k, where j is now the row you are currently on. This gives you a stub with 3 means written on it. You have performed 10,240,000 additions and 1 division per mean, of which there are 3. In total, you've done 30,720,003 flops.

So now you're saying to yourself "What! All of that work, only to show that the slower method actually does ~less~ work?! " Here's the problem: Although Method B has less work to do, it does not have a lot less work to do, meaning just from a flop standpoint, we would expect things to be similar in terms of runtime.

You also have to consider the size of your reshaped array in Method B: a matrix with 10,240,000 rows is HUGE!!! It's really hard/inefficient for the computer to access all of that, and more memory accesses means longer runtimes. The fact is that in its original 10,000x32x32x3 shape, the matrix was already partitioned into convenient slices that the computer could access more efficiently: this is actually a common technique when handling giant matrices Jaime's response to a similar question or even this article: both talk about how breaking up a big matrix into smaller slices helps your program be more memory efficient, therefore making it run faster.

144

answered Oct 17 '22 10:10

v0rtex20k

Related questions
                            
                                optimal data structure to store million of pixels in python?
                            
                                Calling a JS File Within Jupyter Notebook & Sharing Data
                            
                                Wrap text in matplotlib table
                            
                                Stream realtime video between 2 computers using Python
                            
                                Live speech recognition
                            
                                How to save an image list in PDF using PIL (pillow)?
                            
                                How to handle the connection event of widget output in Orange3?
                            
                                Editable plots in PowerPoint from python: equivalent of officer and rvg
                            
                                Cannot plot predicted time series values using matplotlib
                            
                                Using and Randomizing Proxies
                            
                                How to simplify matrix expressions in SymPy?
                            
                                Efficient text preprocessing using PySpark (clean, tokenize, stopwords, stemming, filter)
                            
                                How to exchange Msgpack files between Python and R?
                            
                                ffill weird behavior , when have the duplicate columns names
                            
                                Copy flask request/app context to another process
                            
                                OSError: Could not find geos_c.dll or load any of its variants
                            
                                Django admin returns 404 on POST, 200 on GET
                            
                                Multivariate time series forecasting with 3 months dataset
                            
                                No module named 'Cython' with pip installation of tar.gz
                            
                                error reindex from a duplicate axis in groupby

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Numpy mean of flattened large array slower than mean of mean of all axes

Tags:

performance

python

optimization

numpy

numpy-ufunc

Ganea Dan Andrei

People also ask

1 Answers

v0rtex20k

Recent Activity

Donate For Us