The accessing time of a numpy array is impacted much more by the last index compared to the second last

Question

This is a follow up to this answer to my previous question Fastest approach to read thousands of images into one big numpy array.

In chapter 2.3 "Memory allocation of the ndarray", Travis Oliphant writes the following regarding how indexes are accessed in memory for C-ordered numpy arrays.

...to move through computer memory sequentially, the last index is incremented first, followed by the second-to-last index and so forth.

This can be confirmed by benchmarking the accessing time of 2-D arrays either along the two first or the two last indexes (for my purposes, this is a simulation of loading 500 images of size 512x512 pixels):

import numpy as np

N = 512
n = 500
a = np.random.randint(0,255,(N,N))

def last_and_second_last():
    '''Store along the two last indexes'''
    imgs = np.empty((n,N,N), dtype='uint16')
    for num in range(n):
        imgs[num,:,:] = a
    return imgs

def second_and_third_last():
    '''Store along the two first indexes'''
    imgs = np.empty((N,N,n), dtype='uint16')
    for num in range(n):
        imgs[:,:,num] = a
    return imgs

Benchmark

In [2]: %timeit last_and_second_last()
136 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit second_and_third_last()
1.56 s ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So far so good. However, when I load arrays along the last and third last dimension, this is almost as fast as loading them into the two last dimensions.

def last_and_third_last():
    '''Store along the last and first indexes'''
    imgs = np.empty((N,n,N), dtype='uint16')
    for num in range(n):    
        imgs[:,num,:] = a
    return imgs

Benchmark

In [4]: %timeit last_and_third_last()
149 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why is it that last_and_third_last() is so my closer in speed to last_and_second_last() compared to second_and third_last()?
What's a good way to visualize why the last index matters much more than the second last index in regards to the accessing speed?

hpaulj · Accepted Answer

I'll try to illustrate the indexing, without getting into details of processor caching etc.

Lets make a small 3d array with distinctive element values:

In [473]: X = np.mgrid[100:300:100,10:30:10,1:4:1].sum(axis=0)
In [474]: X
Out[474]: 
array([[[111, 112, 113],
        [121, 122, 123]],

       [[211, 212, 213],
        [221, 222, 223]]])
In [475]: X.shape
Out[475]: (2, 2, 3)

ravel views it as a 1d array, and shows us how the values are laid out in memory. (This, incidentally,is with the default C ordering)

In [476]: X.ravel()
Out[476]: array([111, 112, 113, 121, 122, 123, 211, 212, 213, 221, 222, 223])

When I index on the 1st dimension I get 2*3 values, a contiguous block of the above list:

In [477]: X[0,:,:].ravel()
Out[477]: array([111, 112, 113, 121, 122, 123])

Indexing instead on the last gives 4 values, selected from across the array - I've added .. to highlight that

In [478]: X[:,:,0].ravel()
Out[478]: array([111,.. 121,.. 211,.. 221])

Indexing on the middle gives me 2 contiguous subblocks, i.e. 2 rows of X.

In [479]: X[:,0,:].ravel()
Out[479]: array([111, 112, 113,.. 211, 212, 213])

With the strides and shape calculations numpy can access any one element in X in (about) the same time. And in the X[:,:,i] case that's what it has to do. The 4 values are 'scattered' across the databuffer.

But if it can access contiguous blocks, such as in X[i,:,:], it can delegate more of the action to low level compiled and processor code. With X[:,i,:] those blocks aren't quite as big, but may still be big enough to make a big difference.

In your test case, [n,:,:] iterates 500 times on 512*512 element blocks.

[:,n,:] has to divide that access into 512 blocks of 512 each.

[:,:,n] has to do 500 x 512 x 512 individual accesses.

I wonder if working with uint16 exaggerates the effect. In another question we just showed that calculation with float16 is much slower (up to 10x) because the processor (and compiler) is tuned to work with 32 and 64 bit numbers. If the processor is tuned to moving blocks of 64bit numbers around, then moving an isolated 16 bit number could require a lot of extra processing. It would be like doing a copy-n-paste from a document word-by-word, when copying line-by-line requires fewer key strokes per copy.

The exact details are buried in the processor, operating system and compiler, as well as numpy code, but hopefully this gives a feel for why your middle case falls much closer to the optimum than to the worst case.

On testing - setting imgs to a.dtype slows things down a bit for all cases. So the 'uint16' isn't causing any special problems.

Why does `numpy.einsum` work faster with `float32` than `float16` or `uint16`?

Aaron · Answer

Numpy's arrays are built on c and c++ so we get to think about stuff like cache lines when we're pushing it to it's absolute limits. In last_and_second_last(): and last_and_third_last(): you are reading more than a single byte along the last axis, so an entire cache line is used at a time (16 actually as your last axis is 1024 bytes long). In second_and_third_last(), an entire cache line must be copied to read (or write) a single value in the last axis. Modern c compilers (and others: fortran, etc.) will take nested loops that access memory in the wrong order like this and silently re-order them to optimize cache usage, but python cannot do this.

Example:

Assume you have a very basic processor where the cache is 4 words wide.
Your array is 4x4x4 arr = np.arange(64).reshape([4,4,4])
If you want to access arr[i,j,:] you can load all of them in the cache at once (ex: [0,1,2,3])
If you want to access arr[i,:,k]
- The processor will first load arr[i,0,:] and read [k] from that array of 4
- Then it will load arr[i,1,:] and read [k] from that array of 4
- etc...

Eric · Answer

The key point here is not that it is the last axis, but that it is the axis ax where imgs.strides[ax] == imgs.dtype.itemsize — that is, the memory is contiguous along that axis. The default behaviour is to apply this to the last axis, but don't assume that - you'll see the opposite behaviour with imgs.T (as this creates a view, reversing the strides array)

When NumPy detects that axes are contiguous, it uses a memcpy on the entire dimension, which compilers optimize significantly. In other cases, NumPy is only able to memcpy one element at a time

The accessing time of a numpy array is impacted much more by the last index compared to the second last

Tags:

performance

python

arrays

memory

numpy

joelostblom

3 Answers

hpaulj

Aaron

Eric

Recent Activity

Donate For Us

The accessing time of a numpy array is impacted much more by the last index compared to the second last

Tags:

performance

python

arrays

memory

numpy

joelostblom

3 Answers

hpaulj

Aaron

Eric

Related questions

Recent Activity

Donate For Us