Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy ufuncs are 2x faster in one axis over the other

I was doing some computation, and measured the performance of ufuncs like np.cumsum over different axes, to make the code more performant.

In [51]: arr = np.arange(int(1E6)).reshape(int(1E3), -1)

In [52]: %timeit arr.cumsum(axis=1)
2.27 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [53]: %timeit arr.cumsum(axis=0)
4.16 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

cumsum over axis 1 is almost 2x faster than cumsum over axis 0. Why is it so and what is going on behind the scenes? It'd be nice to have a clear understanding of the reason behind it. Thanks!


Update: After a bit of research, I realized that if someone is building an application where they always sum over only certain axis, then the array should be initialized in appropriate order: i.e. either C-order for axis=1 sums or Fortran-order for axis=0 sums, to save CPU time.

Also: this excellent answer on the difference between contiguous and non-contiguous arrays helped a lot!

like image 878
kmario23 Avatar asked Jan 27 '18 01:01

kmario23


2 Answers

You have a square array. It looks like this:

1 2 3
4 5 6
7 8 9

But computer memory is linearly addressed, so to the computer it looks like this:

1 2 3 4 5 6 7 8 9

Or, if you think about it, it might look like this:

1 4 7 2 5 8 3 6 9

If you are trying to sum [1 2 3] or [4 5 6] (one row), the first layout is faster. If you are trying to sum [1 4 7] or [2 5 8], the second layout is faster.

This happens because loading data from memory happens one "cache line" at a time, which is typically 64 bytes (8 values with NumPy's default dtype of 8-byte float).

You can control which layout NumPy uses when you construct an array, using the order parameter.

For more on this, see: https://en.wikipedia.org/wiki/Row-_and_column-major_order

like image 106
John Zwinck Avatar answered Sep 20 '22 15:09

John Zwinck


The arrays are row-major. Therefore, when you're summing over axis 1, the numbers are found in contiguous memory arrays. That allows for better cache performance and therefore faster memory access (cf. "Locality of reference"). I assume that that's the effect you're seeing here.

like image 34
rain city Avatar answered Sep 19 '22 15:09

rain city