Memory (row major order):
[[A(0,0), A(0,1)]
[A(1,0), A(1,1)]]
has this memory layout:
[A(0,0), A(0,1), A(1,0), A(1,1)]
I guess the algorithm work like this in the following cases.
Broadcasting Dimension is last dimension:
[[0, 1, 2, 3] [[1]
x
[4, 5, 6, 7]] [10]]
A (2 by 4) B (2 by 1)
Iterate 0th dimensions of A and B simultaneously {
Iterate last dimension of A{
multiply;
}
}
Broadcasting dimension is 0th dimension:
[[0, 1, 2, 3]
x [[1,10,100,1000]]
[4, 5, 6, 7]]
A (2 by 4) B (1 by 4)
Iterate 0th dimension of A{
Iterate 1st dimensions of A and B simultaneously{
multiply;
}
}
Question:
How does numpy know which order of multiplication is the best. (reading memory in order is better than reading memory all over the place. but how did numpy figure that out?)
What would numpy do if the arrays have more than two dimension
2nd guess of what is going on:
#include <iostream>
int main(void){
const int nA = 12;
const int nB = 3;
int A[nA];
int B[nB];
for(int i = 0; i != nA; ++i) A[i] = i+1;
for(int i = 0; i != nB; ++i) B[i] = i+1;
//dimension
int dA[] = {2,3,2};
int dB[] = {1,3,1};
int* pA = A;
int* pB = B;
int* pA_end = A + nA;
//is it possible to make the compiler
//generate the iA and sA?
int iB = 0;
int iB_max = 2;
int sB[] = {1,0};
while(pA != pA_end){
std::cout << "*pA, *pB: " << *pA << ", " << *pB <<std::endl;
std::cout << "iB: " << iB <<std::endl;
*(pA) *= *(pB);
++pA;
pB += sB[iB];
++iB;
if (iB == iB_max) {iB = 0; pB = B;}
}
for(pA = A; pA != pA_end; ++pA){
std::cout << *(pA) << ", ";
}
std::cout << std::endl;
}
To really get into broadcasting details you need to understand array shape and strides. But a lot of the work is now implemented in c
code using nditer
. You can read about it at http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html. np.nditer
gives you access to the tool at the Python level, but its real value comes when used with cython
or your own c
code.
np.lib.stride_tricks
has functions that let you play with strides. One of its functions helps visualize how arrays are broadcasted together. In practice the work is done with nditer
, but this function may help understand the action:
In [629]: np.lib.stride_tricks.broadcast_arrays(np.arange(6).reshape(2,3),
np.array([[1],[2]]))
Out[629]:
[array([[0, 1, 2],
[3, 4, 5]]),
array([[1, 1, 1],
[2, 2, 2]])]
Note that, effectively the 2nd array has been replicated to match the 1st's shape. But the replication is done with stride tricks, not with actual copies.
In [631]: A,B=np.lib.stride_tricks.broadcast_arrays(np.arange(6).reshape(2,3),
np.array([[1],[2]]))
In [632]: A.shape
Out[632]: (2, 3)
In [633]: A.strides
Out[633]: (12, 4)
In [634]: B.shape
Out[634]: (2, 3)
In [635]: B.strides
Out[635]: (4, 0)
It's this (4,0)
strides that does the replication without copy.
=================
Using the python level nditer
, here's what it does during broadcasting.
In [1]: A=np.arange(6).reshape(2,3)
In [2]: B=np.array([[1],[2]])
The plain nditer feeds elements one set at a time http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html#using-an-external-loop
In [5]: it =np.nditer((A,B))
In [6]: for a,b in it:
...: print(a,b)
0 1
1 1
2 1
3 2
4 2
5 2
But when I turn extenal_loop
on, it iterates in chunks, here respective rows of the broadcasted arrays:
In [7]: it =np.nditer((A,B), flags=['external_loop'])
In [8]: for a,b in it:
...: print(a,b)
[0 1 2] [1 1 1]
[3 4 5] [2 2 2]
With a more complex broadcasting the external_loop
still produces 1d arrays that allow simple c
style iteration:
In [13]: A1=np.arange(24).reshape(3,2,4)
In [18]: it =np.nditer((A1,np.arange(3)[:,None,None]), flags=['external_loop'])
In [19]: while not it.finished:
...: print(it[:])
...: it.iternext()
...:
(array([0, 1, 2, 3, 4, 5, 6, 7]), array([0, 0, 0, 0, 0, 0, 0, 0]))
(array([ 8, 9, 10, 11, 12, 13, 14, 15]), array([1, 1, 1, 1, 1, 1, 1, 1]))
(array([16, 17, 18, 19, 20, 21, 22, 23]), array([2, 2, 2, 2, 2, 2, 2, 2]))
Note that while A1
is (3,2,4), the nditer loop yields 3 steps (1st axis) with 2*4 length elements.
I found in another cython/nditer
SO question that the first approach did not produce much of a speed improvement, but the 2nd helped a lot. In c
or cython
the external_loop
case would do simple low level iteration.
===============
If I broadcast on the 1 and 3rd axis, the iterator takes 2*3 steps (effectively flattening the 1st 2 axes, and feeding the 3rd):
In [20]: it =np.nditer((A1,np.arange(2)[None,:,None]), flags=['external_loop'])
In [21]: while not it.finished:
...: print(it[:])
...: it.iternext()
...:
(array([0, 1, 2, 3]), array([0, 0, 0, 0]))
(array([4, 5, 6, 7]), array([1, 1, 1, 1]))
(array([ 8, 9, 10, 11]), array([0, 0, 0, 0]))
(array([12, 13, 14, 15]), array([1, 1, 1, 1]))
(array([16, 17, 18, 19]), array([0, 0, 0, 0]))
(array([20, 21, 22, 23]), array([1, 1, 1, 1]))
But with buffered
, it iterates once, feeding me 2 1d arrays:
In [22]: it =np.nditer((A1,np.arange(2)[None,:,None]), flags=['external_loop','buffered'])
In [23]: while not it.finished:
...: print(it[:])
...: it.iternext()
...:
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]),
array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]))
Does Cython offer any reasonably easy and efficient way to iterate Numpy arrays as if they were flat? has some speed tests, showing that buffered external loop is fastest
cython
translates this into fast simple c
iteration:
for xarr in it:
x = xarr
size = x.shape[0]
for i in range(size):
x[i] = x[i]+1.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With