This is a follow up to this answer to my previous question Fastest approach to read thousands of images into one big numpy array.
In chapter 2.3 "Memory allocation of the ndarray", Travis Oliphant writes the following regarding how indexes are accessed in memory for C-ordered numpy arrays.
...to move through computer memory sequentially, the last index is incremented first, followed by the second-to-last index and so forth.
This can be confirmed by benchmarking the accessing time of 2-D arrays either along the two first or the two last indexes (for my purposes, this is a simulation of loading 500 images of size 512x512 pixels):
import numpy as np
N = 512
n = 500
a = np.random.randint(0,255,(N,N))
def last_and_second_last():
'''Store along the two last indexes'''
imgs = np.empty((n,N,N), dtype='uint16')
for num in range(n):
imgs[num,:,:] = a
return imgs
def second_and_third_last():
'''Store along the two first indexes'''
imgs = np.empty((N,N,n), dtype='uint16')
for num in range(n):
imgs[:,:,num] = a
return imgs
Benchmark
In [2]: %timeit last_and_second_last()
136 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %timeit second_and_third_last()
1.56 s ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So far so good. However, when I load arrays along the last and third last dimension, this is almost as fast as loading them into the two last dimensions.
def last_and_third_last():
'''Store along the last and first indexes'''
imgs = np.empty((N,n,N), dtype='uint16')
for num in range(n):
imgs[:,num,:] = a
return imgs
Benchmark
In [4]: %timeit last_and_third_last()
149 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
last_and_third_last()
is so my closer in speed to last_and_second_last()
compared to second_and third_last()
?I'll try to illustrate the indexing, without getting into details of processor caching etc.
Lets make a small 3d array with distinctive element values:
In [473]: X = np.mgrid[100:300:100,10:30:10,1:4:1].sum(axis=0)
In [474]: X
Out[474]:
array([[[111, 112, 113],
[121, 122, 123]],
[[211, 212, 213],
[221, 222, 223]]])
In [475]: X.shape
Out[475]: (2, 2, 3)
ravel
views it as a 1d array, and shows us how the values are laid out in memory. (This, incidentally,is with the default C
ordering)
In [476]: X.ravel()
Out[476]: array([111, 112, 113, 121, 122, 123, 211, 212, 213, 221, 222, 223])
When I index on the 1st dimension I get 2*3 values, a contiguous block of the above list:
In [477]: X[0,:,:].ravel()
Out[477]: array([111, 112, 113, 121, 122, 123])
Indexing instead on the last gives 4 values, selected from across the array - I've added ..
to highlight that
In [478]: X[:,:,0].ravel()
Out[478]: array([111,.. 121,.. 211,.. 221])
Indexing on the middle gives me 2 contiguous subblocks, i.e. 2 rows of X
.
In [479]: X[:,0,:].ravel()
Out[479]: array([111, 112, 113,.. 211, 212, 213])
With the strides
and shape
calculations numpy
can access any one element in X
in (about) the same time. And in the X[:,:,i]
case that's what it has to do. The 4 values are 'scattered' across the databuffer.
But if it can access contiguous blocks, such as in X[i,:,:]
, it can delegate more of the action to low level compiled and processor code. With X[:,i,:]
those blocks aren't quite as big, but may still be big enough to make a big difference.
In your test case, [n,:,:]
iterates 500 times on 512*512 element blocks.
[:,n,:]
has to divide that access into 512 blocks of 512 each.
[:,:,n]
has to do 500 x 512 x 512 individual accesses.
I wonder if working with uint16
exaggerates the effect. In another question we just showed that calculation with float16
is much slower (up to 10x) because the processor (and compiler) is tuned to work with 32 and 64 bit numbers. If the processor is tuned to moving blocks of 64bit numbers around, then moving an isolated 16 bit number could require a lot of extra processing. It would be like doing a copy-n-paste from a document word-by-word, when copying line-by-line requires fewer key strokes per copy.
The exact details are buried in the processor, operating system and compiler, as well as numpy
code, but hopefully this gives a feel for why your middle case falls much closer to the optimum than to the worst case.
On testing - setting imgs
to a.dtype
slows things down a bit for all cases. So the 'uint16' isn't causing any special problems.
Why does `numpy.einsum` work faster with `float32` than `float16` or `uint16`?
Numpy's arrays are built on c and c++ so we get to think about stuff like cache lines when we're pushing it to it's absolute limits. In last_and_second_last():
and last_and_third_last():
you are reading more than a single byte along the last axis, so an entire cache line is used at a time (16 actually as your last axis is 1024 bytes long). In second_and_third_last()
, an entire cache line must be copied to read (or write) a single value in the last axis. Modern c compilers (and others: fortran, etc.) will take nested loops that access memory in the wrong order like this and silently re-order them to optimize cache usage, but python cannot do this.
Example:
arr = np.arange(64).reshape([4,4,4])
arr[i,j,:]
you can load all of them in the cache at once (ex: [0,1,2,3]
)arr[i,:,k]
arr[i,0,:]
and read [k]
from that array of 4arr[i,1,:]
and read [k]
from that array of 4The key point here is not that it is the last axis, but that it is the axis ax
where imgs.strides[ax] == imgs.dtype.itemsize
— that is, the memory is contiguous along that axis. The default behaviour is to apply this to the last axis, but don't assume that - you'll see the opposite behaviour with imgs.T
(as this creates a view, reversing the strides
array)
When NumPy detects that axes are contiguous, it uses a memcpy
on the entire dimension, which compilers optimize significantly. In other cases, NumPy is only able to memcpy one element at a time
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With