Assuming I want have a numpy array of size (n,m)
where n
is very large, but with a lot of duplication, ie. 0:n1
are identical, n1:n2
are identical etc. (with n2%n1!=0
, ie not regular intervals). Is there a way to store only one set of values for each of the duplicates while having a view of the entire array?
example:
unique_values = np.array([[1,1,1], [2,2,2] ,[3,3,3]]) #these are the values i want to store in memory
index_mapping = np.array([0,0,1,1,1,2,2]) # a mapping between index of array above, with array below
unique_values_view = np.array([[1,1,1],[1,1,1],[2,2,2],[2,2,2],[2,2,2], [3,3,3],[3,3,3]]) #this is how I want the view to look like for broadcasting reasons
I plan to multiply array(view) by some other array of size (1,m)
, and take the dot product of this product:
other_array1 = np.arange(unique_values.shape[1]).reshape(1,-1) # (1,m)
other_array2 = 2*np.ones((unique_values.shape[1],1)) # (m,1)
output = np.dot(unique_values_view * other_array1, other_array2).squeeze()
Output is a 1D array of length n
.
The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.
Limitations of Broadcasting Arithmetic, including broadcasting, can only be performed when the shape of each dimension in the arrays are equal or one has the dimension size of 1.
Your expression admits two significant optimizations:
other_array1
with other_array2
first and then with
unique_values
Let's apply these optimizations:
>>> output_pp = (unique_values @ (other_array1.ravel() * other_array2.ravel()))[index_mapping]
# check for correctness
>>> (output == output_pp).all()
True
# and compare it to @Yakym Pirozhenko's approach
>>> from timeit import timeit
>>> print("yp:", timeit("np.dot(unique_values * other_array1, other_array2).squeeze()[index_mapping]", globals=globals()))
yp: 3.9105667411349714
>>> print("pp:", timeit("(unique_values @ (other_array1.ravel() * other_array2.ravel()))[index_mapping]", globals=globals()))
pp: 2.2684884609188884
These optimizations are easy to spot if we observe two things:
(1) if A
is an mxn
-matrix and b
is an n
-vector then
A * b == A @ diag(b)
A.T * b[:, None] == diag(b) @ A.T
(2) if A
is anmxn
-matrix and I
is a k
-vector of integers from
range(m)
then
A[I] == onehot(I) @ A
onehot
can be defined as
def onehot(I, m, dtype=int):
out = np.zeros((I.size, m), dtype=dtype)
out[np.arange(I.size), I] = 1
return out
Using these facts and abbreviating uv
, im
, oa1
and oa2
we can write
uv[im] * oa1 @ oa2 == onehot(im) @ uv @ diag(oa1) @ oa2
The above optimizations are now simply a matter of choosing the best order for these matrix multiplications which is
onehot(im) @ (uv @ (diag(oa1) @ oa2))
Using (1) and (2) backwards on this we obtain the optimized expression from the beginning of this post.
Based on your example, you can simply factor the index mapping through the computation to the very end:
output2 = np.dot(unique_values * other_array1, other_array2).squeeze()[index_mapping]
assert (output == output2).all()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With