Below, I compared performance when dealing with sum operations between C-contiguous and Fortran-contiguous arrays (C vs FORTRAN memory order). I set axis=0
to ensure numbers are added up column wise. I was surprised that Fortran-contiguous array is actually slower than its C counterpart. Isn't that Fortran-contiguous array has contiguous memory allocation in columns and hence is better at column-wise operation?
import numpy as np
a = np.random.standard_normal((10000, 10000))
c = np.array(a, order='C')
f = np.array(a, order='F')
In Jupyter notebook, run
%timeit c.sum(axis=0)
10 loops, best of 3: 84.6 ms per loop
%timeit f.sum(axis=0)
10 loops, best of 3: 137 ms per loop
1. Contiguous memory allocation allocates consecutive blocks of memory to a file/process. Non-Contiguous memory allocation allocates separate blocks of memory to a file/process. 2.
The CONTIGUOUS attribute specifies that the array elements of an array pointer, an assumed-shape array , or an assumed-rank object.
An array is a contiguous collection of homogeneous elements that can be accessed using an index.
In the computer's memory, the values of arr are stored like this: This means arr is a C contiguous array because the rows are stored as contiguous blocks of memory. The next memory address holds the next row value on that row.
I think it's in the implementation of the np.sum(). For example:
import numpy as np
A = np.random.standard_normal((10000,10000))
C = np.array(A, order='C')
F = np.array(A, order='F')
Benchmarking with Ipython:
In [7]: %timeit C.sum(axis=0)
10 loops, best of 3: 101 ms per loop
In [8]: %timeit C.sum(axis=1)
10 loops, best of 3: 149 ms per loop
In [9]: %timeit F.sum(axis=0)
10 loops, best of 3: 149 ms per loop
In [10]: %timeit F.sum(axis=1)
10 loops, best of 3: 102 ms per loop
So it's behaving exactly the opposite as expected. But let's try out some other function:
In [17]: %timeit np.amax(C, axis=0)
1 loop, best of 3: 173 ms per loop
In [18]: %timeit np.amax(C, axis=1)
10 loops, best of 3: 70.4 ms per loop
In [13]: %timeit np.amax(F,axis=0)
10 loops, best of 3: 72 ms per loop
In [14]: %timeit np.amax(F,axis=1)
10 loops, best of 3: 168 ms per loop
Sure, it's apples to oranges. But np.amax() works along the axis as does sum and returns a vector with one element for each row/column. And behaves as one would expect.
In [25]: C.strides
Out[25]: (80000, 8)
In [26]: F.strides
Out[26]: (8, 80000)
Tells us that the arrays are in fact packed in row-order and column-order and looping in that direction should be a lot faster. Unless for example the sum sums each row by row as it travels along the columns for providing the column sum (axis=0). But without a mean of peeking inside the .pyd I'm just speculating.
EDIT:
From percusse 's link: http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduce.html
Reduces a‘s dimension by one, by applying ufunc along one axis.
Let a.shape = (N_0, ..., N_i, ..., N_{M-1}). Then ufunc.reduce(a, axis=i)[k_0, ..,k_{i-1}, k_{i+1}, .., k_{M-1}] = the result of iterating j over range(N_i), cumulatively applying ufunc to each a[k_0, ..,k_{i-1}, j, k_{i+1}, .., k_{M-1}]
So in pseudocode, when calling F.sum(axis=0):
for j=cols #axis=0
for i=rows #axis=1
sum(j,i)=F(j,i)+sum(j-1,i)
So it would actually iterate over the row when calculating the column sum, slowing down considerably when in column-major order. Behaviour such as this would explain the difference.
eric's link provides us with the implementation, for somebody curious enough to go through large amounts of code for the reason.
That's expected. If you check the result of
%timeit f.sum(axis=1)
it also gives similar result with the timing of c
. Similarly,
%timeit c.sum(axis=1)
is slower.
Some explanation: suppose you have the following structure
|1| |6|
|2| |7|
|3| |8|
|4| |9|
|5| |10|
As Eric mentioned, these operations work with reduce
. Let's say we are asking for column sum. So intuitive mechanism is not at play such that each column is accessed once, summed, and recorded. In fact it is the opposite such that each row is accessed and the function (here summing) is performed in essence similar to having two arrays a,b
and executing
a += b
That's a very informal way of repeating what is mentioned super-cryptically in the documentation of reduce. This requires rows to be accessed contiguously although we are performing a column sum [1,6] + [2,7] + [3,8]... Hence, the implementation direction matters depending on the operation but not the array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With