Performance between C-contiguous and Fortran-contiguous array operations

Tags:

Below, I compared performance when dealing with sum operations between C-contiguous and Fortran-contiguous arrays (C vs FORTRAN memory order). I set axis=0 to ensure numbers are added up column wise. I was surprised that Fortran-contiguous array is actually slower than its C counterpart. Isn't that Fortran-contiguous array has contiguous memory allocation in columns and hence is better at column-wise operation?

import numpy as np
a = np.random.standard_normal((10000, 10000))
c = np.array(a, order='C')
f = np.array(a, order='F')

In Jupyter notebook, run

%timeit c.sum(axis=0)
10 loops, best of 3: 84.6 ms per loop

%timeit f.sum(axis=0)
10 loops, best of 3: 137 ms per loop

393

asked Aug 20 '16 13:08

Nicholas

2 Answers

I think it's in the implementation of the np.sum(). For example:

import numpy as np

A = np.random.standard_normal((10000,10000))
C = np.array(A, order='C')
F = np.array(A, order='F')

Benchmarking with Ipython:

In [7]: %timeit C.sum(axis=0)
10 loops, best of 3: 101 ms per loop

In [8]: %timeit C.sum(axis=1)
10 loops, best of 3: 149 ms per loop

In [9]: %timeit F.sum(axis=0)
10 loops, best of 3: 149 ms per loop

In [10]: %timeit F.sum(axis=1)
10 loops, best of 3: 102 ms per loop

So it's behaving exactly the opposite as expected. But let's try out some other function:

In [17]: %timeit np.amax(C, axis=0)
1 loop, best of 3: 173 ms per loop

In [18]: %timeit np.amax(C, axis=1)
10 loops, best of 3: 70.4 ms per loop

In [13]: %timeit np.amax(F,axis=0)
10 loops, best of 3: 72 ms per loop

In [14]: %timeit np.amax(F,axis=1)
10 loops, best of 3: 168 ms per loop

Sure, it's apples to oranges. But np.amax() works along the axis as does sum and returns a vector with one element for each row/column. And behaves as one would expect.

In [25]: C.strides
Out[25]: (80000, 8)

In [26]: F.strides
Out[26]: (8, 80000)

Tells us that the arrays are in fact packed in row-order and column-order and looping in that direction should be a lot faster. Unless for example the sum sums each row by row as it travels along the columns for providing the column sum (axis=0). But without a mean of peeking inside the .pyd I'm just speculating.

EDIT:

From percusse 's link: http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduce.html

Reduces a‘s dimension by one, by applying ufunc along one axis.

Let a.shape = (N_0, ..., N_i, ..., N_{M-1}). Then ufunc.reduce(a, axis=i)[k_0, ..,k_{i-1}, k_{i+1}, .., k_{M-1}] = the result of iterating j over range(N_i), cumulatively applying ufunc to each a[k_0, ..,k_{i-1}, j, k_{i+1}, .., k_{M-1}]

So in pseudocode, when calling F.sum(axis=0):

for j=cols #axis=0
    for i=rows #axis=1
        sum(j,i)=F(j,i)+sum(j-1,i)

So it would actually iterate over the row when calculating the column sum, slowing down considerably when in column-major order. Behaviour such as this would explain the difference.

eric's link provides us with the implementation, for somebody curious enough to go through large amounts of code for the reason.

188

answered Oct 12 '22 19:10

Tapio

That's expected. If you check the result of

%timeit f.sum(axis=1)

it also gives similar result with the timing of c. Similarly,

%timeit c.sum(axis=1)

is slower.

Some explanation: suppose you have the following structure

|1| |6|
|2| |7|
|3| |8|
|4| |9|
|5| |10|

As Eric mentioned, these operations work with reduce. Let's say we are asking for column sum. So intuitive mechanism is not at play such that each column is accessed once, summed, and recorded. In fact it is the opposite such that each row is accessed and the function (here summing) is performed in essence similar to having two arrays a,b and executing

a += b

That's a very informal way of repeating what is mentioned super-cryptically in the documentation of reduce. This requires rows to be accessed contiguously although we are performing a column sum [1,6] + [2,7] + [3,8]... Hence, the implementation direction matters depending on the operation but not the array.

answered Oct 12 '22 19:10

percusse

Related questions
                            
                                Plot : Too many ticks on X axe
                            
                                Flask - Store values in memory between requests
                            
                                SQLAlchemy: Using delete/update with a join query
                            
                                Is it possible to run another spider from Scrapy spider?
                            
                                Python - Selenium PhantomJS - JSON Error
                            
                                Managing requirements in IPython Notebook?
                            
                                Subclassing matplotlib Text: manipulate properties of child artist
                            
                                Viewing Django and webpack built site on LAN
                            
                                Adding data to Pandas Dataframe from a CSV file causing Value Errors
                            
                                Import keras.datasets not working
                            
                                Image in README.rst not displaying in pypi
                            
                                `pip freeze` breaks with package installation
                            
                                How to use PyCharm for GIMP plugin development?
                            
                                PySpark: Add a column to DataFrame when column is a list
                            
                                pytest with setup.py test
                            
                                Extracting one-hot vector from text
                            
                                How to live with both enum and enum34?
                            
                                Python multiprocessing - AssertionError: can only join a child process
                            
                                Process request thread error with Flask Application?
                            
                                Windows alternative to pexpect

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance between C-contiguous and Fortran-contiguous array operations

Tags:

python

multidimensional-array

numpy

scipy

Nicholas

People also ask

2 Answers

Tapio

percusse

Recent Activity

Donate For Us