Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

broadcast views irregularly numpy

Assuming I want have a numpy array of size (n,m) where n is very large, but with a lot of duplication, ie. 0:n1 are identical, n1:n2 are identical etc. (with n2%n1!=0, ie not regular intervals). Is there a way to store only one set of values for each of the duplicates while having a view of the entire array?

example:

unique_values = np.array([[1,1,1], [2,2,2] ,[3,3,3]]) #these are the values i want to store in memory
index_mapping = np.array([0,0,1,1,1,2,2]) # a mapping between index of array above, with array below

unique_values_view = np.array([[1,1,1],[1,1,1],[2,2,2],[2,2,2],[2,2,2], [3,3,3],[3,3,3]]) #this is how I want the view to look like for broadcasting reasons

I plan to multiply array(view) by some other array of size (1,m), and take the dot product of this product:

other_array1 = np.arange(unique_values.shape[1]).reshape(1,-1) # (1,m)
other_array2 = 2*np.ones((unique_values.shape[1],1)) # (m,1)
output = np.dot(unique_values_view * other_array1, other_array2).squeeze()

Output is a 1D array of length n.

like image 346
M.T Avatar asked Jun 01 '18 07:06

M.T


People also ask

Does NumPy have broadcasting?

The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

What are the limitations of broadcasting in Python?

Limitations of Broadcasting Arithmetic, including broadcasting, can only be performed when the shape of each dimension in the arrays are equal or one has the dimension size of 1.


2 Answers

Your expression admits two significant optimizations:

  • do the indexing last
  • multiply other_array1 with other_array2 first and then with unique_values

Let's apply these optimizations:

>>> output_pp = (unique_values @ (other_array1.ravel() * other_array2.ravel()))[index_mapping]

# check for correctness
>>> (output == output_pp).all()
True

# and compare it to @Yakym Pirozhenko's approach
>>> from timeit import timeit
>>> print("yp:", timeit("np.dot(unique_values * other_array1, other_array2).squeeze()[index_mapping]", globals=globals()))
yp: 3.9105667411349714
>>> print("pp:", timeit("(unique_values @ (other_array1.ravel() * other_array2.ravel()))[index_mapping]", globals=globals()))
pp: 2.2684884609188884

These optimizations are easy to spot if we observe two things:

(1) if A is an mxn-matrix and b is an n-vector then

A * b == A @ diag(b)
A.T * b[:, None] == diag(b) @ A.T

(2) if A is anmxn-matrix and I is a k-vector of integers from range(m) then

A[I] == onehot(I) @ A

onehot can be defined as

def onehot(I, m, dtype=int):
    out = np.zeros((I.size, m), dtype=dtype)
    out[np.arange(I.size), I] = 1
    return out

Using these facts and abbreviating uv, im, oa1 and oa2 we can write

uv[im] * oa1 @ oa2 == onehot(im) @ uv @ diag(oa1) @ oa2

The above optimizations are now simply a matter of choosing the best order for these matrix multiplications which is

onehot(im) @ (uv @ (diag(oa1) @ oa2))

Using (1) and (2) backwards on this we obtain the optimized expression from the beginning of this post.

like image 42
Paul Panzer Avatar answered Oct 09 '22 02:10

Paul Panzer


Based on your example, you can simply factor the index mapping through the computation to the very end:

output2 = np.dot(unique_values * other_array1, other_array2).squeeze()[index_mapping]

assert (output == output2).all()
like image 74
hilberts_drinking_problem Avatar answered Oct 09 '22 02:10

hilberts_drinking_problem