Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory Efficient L2 norm using Python broadcasting

I am trying to implement a way to cluster points in a test dataset based on their similarity to a sample dataset, using Euclidean distance. The test dataset has 500 points, each point is a N dimensional vector (N=1024). The training dataset has around 10000 points and each point is also a 1024- dim vector. The goal is to find the L2-distance between each test point and all the sample points to find the closest sample (without using any python distance functions). Since the test array and training array have different sizes, I tried using broadcasting:

    import numpy as np
    dist = np.sqrt(np.sum( (test[:,np.newaxis] - train)**2, axis=2))

where test is an array of shape (500,1024) and train is an array of shape (10000,1024). I am getting a MemoryError. However, the same code works for smaller arrays. For example:

     test= np.array([[1,2],[3,4]])
     train=np.array([[1,0],[0,1],[1,1]])

Is there a more memory efficient way to do the above computation without loops? Based on the posts online, we can implement L2- norm using matrix multiplication sqrt(X * X-2*X * Y+Y * Y). So I tried the following:

    x2 = np.dot(test, test.T)
    y2 = np.dot(train,train.T)
    xy = 2* np.dot(test,train.T)

    dist = np.sqrt(x2 - xy + y2)

Since the matrices have different shapes, when I tried to broadcast, there is a dimension mismatch and I am not sure what is the right way to broadcast (dont have much experience with Python broadcasting). I would like to know what is the right way to implement the L2 distance computation as a matrix multiplication in Python, where the matrices have different shapes. The resultant distance matrix should have dist[i,j] = Euclidean distance between test point i and sample point j.

thanks

like image 223
user1462351 Avatar asked Sep 30 '15 02:09

user1462351


People also ask

What is broadcasting in Python with example?

Broadcasting provides a convenient way of taking the outer product (or any other outer operation) of two arrays. The following example shows an outer addition operation of two 1-d arrays: >>> a = np. array([0.0, 10.0, 20.0, 30.0]) >>> b = np.

Can Python broadcast multiple dimensions?

The arrays can be broadcast together iff they are compatible with all dimensions. After broadcasting, each array behaves as if it had shape equal to the element-wise maximum of shapes of the two input arrays.

Does NumPy have broadcasting?

NumPy can perform such operation by using the concept of broadcasting. In broadcasting, the smaller array is broadcast to the larger array to make their shapes compatible with each other.

When broadcasting an operation in NumPy which dimensions get broadcast first?

NumPy broadcasting typically matches dimensions from the last dimension, so usual broadcasting will not work (it would require the first array to have dimension (y,z) ). Background: I'm working with images, some of which are RGB (shape (h,w,3) ) and some of which are grayscale (shape (h,w) ).


2 Answers

Here is broadcasting with shapes of the intermediates made explicit:

m = x.shape[0] # x has shape (m, d)
n = y.shape[0] # y has shape (n, d)
x2 = np.sum(x**2, axis=1).reshape((m, 1))
y2 = np.sum(y**2, axis=1).reshape((1, n))
xy = x.dot(y.T) # shape is (m, n)
dists = np.sqrt(x2 + y2 - 2*xy) # shape is (m, n)

The documentation on broadcasting has some pretty good examples.

like image 112
spiky Avatar answered Sep 25 '22 06:09

spiky


I think what you are asking for already exists in scipy in the form of the cdist function.

from scipy.spatial.distance import cdist
res = cdist(test, train, metric='euclidean')
like image 24
John Greenall Avatar answered Sep 25 '22 06:09

John Greenall