Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance between C-contiguous and Fortran-contiguous array operations

Below, I compared performance when dealing with sum operations between C-contiguous and Fortran-contiguous arrays (C vs FORTRAN memory order). I set axis=0 to ensure numbers are added up column wise. I was surprised that Fortran-contiguous array is actually slower than its C counterpart. Isn't that Fortran-contiguous array has contiguous memory allocation in columns and hence is better at column-wise operation?

import numpy as np
a = np.random.standard_normal((10000, 10000))
c = np.array(a, order='C')
f = np.array(a, order='F')

In Jupyter notebook, run

%timeit c.sum(axis=0)
10 loops, best of 3: 84.6 ms per loop
%timeit f.sum(axis=0)
10 loops, best of 3: 137 ms per loop
like image 393
Nicholas Avatar asked Aug 20 '16 13:08

Nicholas


People also ask

What is the difference between contiguous and non contiguous arrays?

1. Contiguous memory allocation allocates consecutive blocks of memory to a file/process. Non-Contiguous memory allocation allocates separate blocks of memory to a file/process. 2.

What is Fortran contiguous?

The CONTIGUOUS attribute specifies that the array elements of an array pointer, an assumed-shape array , or an assumed-rank object.

Are arrays in C contiguous?

An array is a contiguous collection of homogeneous elements that can be accessed using an index.

What is contiguous in C?

In the computer's memory, the values of arr are stored like this: This means arr is a C contiguous array because the rows are stored as contiguous blocks of memory. The next memory address holds the next row value on that row.


2 Answers

I think it's in the implementation of the np.sum(). For example:

import numpy as np

A = np.random.standard_normal((10000,10000))
C = np.array(A, order='C')
F = np.array(A, order='F')

Benchmarking with Ipython:

In [7]: %timeit C.sum(axis=0)
10 loops, best of 3: 101 ms per loop

In [8]: %timeit C.sum(axis=1)
10 loops, best of 3: 149 ms per loop

In [9]: %timeit F.sum(axis=0)
10 loops, best of 3: 149 ms per loop

In [10]: %timeit F.sum(axis=1)
10 loops, best of 3: 102 ms per loop

So it's behaving exactly the opposite as expected. But let's try out some other function:

In [17]: %timeit np.amax(C, axis=0)
1 loop, best of 3: 173 ms per loop

In [18]: %timeit np.amax(C, axis=1)
10 loops, best of 3: 70.4 ms per loop

In [13]: %timeit np.amax(F,axis=0)
10 loops, best of 3: 72 ms per loop

In [14]: %timeit np.amax(F,axis=1)
10 loops, best of 3: 168 ms per loop

Sure, it's apples to oranges. But np.amax() works along the axis as does sum and returns a vector with one element for each row/column. And behaves as one would expect.

In [25]: C.strides
Out[25]: (80000, 8)

In [26]: F.strides
Out[26]: (8, 80000)

Tells us that the arrays are in fact packed in row-order and column-order and looping in that direction should be a lot faster. Unless for example the sum sums each row by row as it travels along the columns for providing the column sum (axis=0). But without a mean of peeking inside the .pyd I'm just speculating.

EDIT:

From percusse 's link: http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduce.html

Reduces a‘s dimension by one, by applying ufunc along one axis.

Let a.shape = (N_0, ..., N_i, ..., N_{M-1}). Then ufunc.reduce(a, axis=i)[k_0, ..,k_{i-1}, k_{i+1}, .., k_{M-1}] = the result of iterating j over range(N_i), cumulatively applying ufunc to each a[k_0, ..,k_{i-1}, j, k_{i+1}, .., k_{M-1}]

So in pseudocode, when calling F.sum(axis=0):

for j=cols #axis=0
    for i=rows #axis=1
        sum(j,i)=F(j,i)+sum(j-1,i)

So it would actually iterate over the row when calculating the column sum, slowing down considerably when in column-major order. Behaviour such as this would explain the difference.

eric's link provides us with the implementation, for somebody curious enough to go through large amounts of code for the reason.

like image 188
Tapio Avatar answered Oct 12 '22 19:10

Tapio


That's expected. If you check the result of

%timeit f.sum(axis=1)

it also gives similar result with the timing of c. Similarly,

%timeit c.sum(axis=1)

is slower.


Some explanation: suppose you have the following structure

|1| |6|
|2| |7|
|3| |8|
|4| |9|
|5| |10|

As Eric mentioned, these operations work with reduce. Let's say we are asking for column sum. So intuitive mechanism is not at play such that each column is accessed once, summed, and recorded. In fact it is the opposite such that each row is accessed and the function (here summing) is performed in essence similar to having two arrays a,b and executing

a += b

That's a very informal way of repeating what is mentioned super-cryptically in the documentation of reduce. This requires rows to be accessed contiguously although we are performing a column sum [1,6] + [2,7] + [3,8]... Hence, the implementation direction matters depending on the operation but not the array.

like image 31
percusse Avatar answered Oct 12 '22 19:10

percusse