Running Numpy version 1.19.2, I get better performance cumulating the mean of every individual axis of an array than by calculating the mean over an already flattened array.
shape = (10000,32,32,3)
mat = np.random.random(shape)
# Call this Method A.
%%timeit
mat_means = mat.mean(axis=0).mean(axis=0).mean(axis=0)
14.6 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
mat_reshaped = mat.reshape(-1,3)
# Call this Method B
%%timeit
mat_means = mat_reshaped.mean(axis=0)
135 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is odd, since doing the mean multiple times has the same bad access pattern (perhaps even worse) than the one on the reshaped array. We also do more operations this way. As a sanity check, I converted the array to FORTRAN order:
mat_reshaped_fortran = mat.reshape(-1,3, order='F')
%%timeit
mat_means = mat_reshaped_fortran.mean(axis=0)
12.2 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This yields the performance improvement I expected.
For Method A, prun
gives:
36 function calls in 0.019 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
3 0.018 0.006 0.018 0.006 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.019 0.019 {built-in method builtins.exec}
3 0.000 0.000 0.019 0.006 _methods.py:143(_mean)
3 0.000 0.000 0.000 0.000 _methods.py:59(_count_reduce_items)
1 0.000 0.000 0.019 0.019 <string>:1(<module>)
3 0.000 0.000 0.019 0.006 {method 'mean' of 'numpy.ndarray' objects}
3 0.000 0.000 0.000 0.000 _asarray.py:86(asanyarray)
3 0.000 0.000 0.000 0.000 {built-in method numpy.array}
3 0.000 0.000 0.000 0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
6 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
6 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
While for Method B:
14 function calls in 0.166 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.166 0.166 0.166 0.166 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.166 0.166 {built-in method builtins.exec}
1 0.000 0.000 0.166 0.166 _methods.py:143(_mean)
1 0.000 0.000 0.000 0.000 _methods.py:59(_count_reduce_items)
1 0.000 0.000 0.166 0.166 <string>:1(<module>)
1 0.000 0.000 0.166 0.166 {method 'mean' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 _asarray.py:86(asanyarray)
1 0.000 0.000 0.000 0.000 {built-in method numpy.array}
1 0.000 0.000 0.000 0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
2 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
2 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Note: np.setbufsize(1e7)
doesn't seem to have any effect.
What is the reason for this performance difference?
Return a copy of the array collapsed into one dimension. 'C' means to flatten in row-major (C-style) order. 'F' means to flatten in column-major (Fortran- style) order.
Ravel is faster than flatten() as it does not occupy any memory. Flatten() is comparatively slower than ravel() as it occupies memory. Ravel is a library-level function. Flatten is a method of an ndarray object.
flatten always returns a copy. ravel returns a view of the original array whenever possible. This isn't visible in the printed output, but if you modify the array returned by ravel, it may modify the entries in the original array. If you modify the entries in an array returned from flatten this will never happen.
The reason why NumPy is fast when used right is that its arrays are extremely efficient. They are like C arrays instead of Python lists.
Let's call your original matrix mat
. mat.shape = (10000,32,32,3)
. Visually, this is like having a "stack" of 10,000 * 32x32x3 * rectangular prisms (I think of them as LEGOs) of floats.
Now lets think about what you did in terms of floating point operations (flops):
In Method A, you do mat.mean(axis=0).mean(axis=0).mean(axis=0)
. Let's break this down:
The grand total of all this is 30,723,072 + 3,168 + 99 = 30,726,339
flops.
In Method B, you do mat_reshaped = mat.reshape(-1,3); mat_means = mat_reshaped.mean(axis=0)
. Let's break this down:
mat
is a long roll of paper of size 10,240,000x3. You take the mean of each column k, where j is now the row you are currently on. This gives you a stub with 3 means written on it. You have performed 10,240,000 additions and 1 division per mean, of which there are 3. In total, you've done 30,720,003 flops
.So now you're saying to yourself "What! All of that work, only to show that the slower method actually does ~less~ work?! " Here's the problem: Although Method B has less work to do, it does not have a lot less work to do, meaning just from a flop standpoint, we would expect things to be similar in terms of runtime.
You also have to consider the size of your reshaped array in Method B: a matrix with 10,240,000 rows is HUGE!!! It's really hard/inefficient for the computer to access all of that, and more memory accesses means longer runtimes. The fact is that in its original 10,000x32x32x3 shape, the matrix was already partitioned into convenient slices that the computer could access more efficiently: this is actually a common technique when handling giant matrices Jaime's response to a similar question or even this article: both talk about how breaking up a big matrix into smaller slices helps your program be more memory efficient, therefore making it run faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With