I am trying to implement a way to cluster points in a test dataset based on their similarity to a sample dataset, using Euclidean distance. The test dataset has 500 points, each point is a N dimensional vector (N=1024). The training dataset has around 10000 points and each point is also a 1024- dim vector. The goal is to find the L2-distance between each test point and all the sample points to find the closest sample (without using any python distance functions). Since the test array and training array have different sizes, I tried using broadcasting: <pre class="prettyprint"><code> import numpy as np dist = np.sqrt(np.sum( (test[:,np.newaxis] - train)**2, axis=2)) </code></pre> where test is an array of shape (500,1024) and train is an array of shape (10000,1024). I am getting a MemoryError. However, the same code works for smaller arrays. For example: <pre class="prettyprint"><code> test= np.array([[1,2],[3,4]]) train=np.array([[1,0],[0,1],[1,1]]) </code></pre> Is there a more memory efficient way to do the above computation without loops? Based on the posts online, we can implement L2- norm using matrix multiplication sqrt(X * X-2*X * Y+Y * Y). So I tried the following: <pre class="prettyprint"><code> x2 = np.dot(test, test.T) y2 = np.dot(train,train.T) xy = 2* np.dot(test,train.T) dist = np.sqrt(x2 - xy + y2) </code></pre> Since the matrices have different shapes, when I tried to broadcast, there is a dimension mismatch and I am not sure what is the right way to broadcast (dont have much experience with Python broadcasting). I would like to know what is the right way to implement the L2 distance computation as a matrix multiplication in Python, where the matrices have different shapes. The resultant distance matrix should have dist[i,j] = Euclidean distance between test point i and sample point j. thanks

I think what you are asking for already exists in scipy in the form of the cdist function. <pre class="prettyprint"><code>from scipy.spatial.distance import cdist res = cdist(test, train, metric='euclidean') </code></pre>

Memory Efficient L2 norm using Python broadcasting

Tags:

python

numpy

euclidean-distance

array-broadcasting

I am trying to implement a way to cluster points in a test dataset based on their similarity to a sample dataset, using Euclidean distance. The test dataset has 500 points, each point is a N dimensional vector (N=1024). The training dataset has around 10000 points and each point is also a 1024- dim vector. The goal is to find the L2-distance between each test point and all the sample points to find the closest sample (without using any python distance functions). Since the test array and training array have different sizes, I tried using broadcasting:

Click to copy

    import numpy as np
    dist = np.sqrt(np.sum( (test[:,np.newaxis] - train)**2, axis=2))

where test is an array of shape (500,1024) and train is an array of shape (10000,1024). I am getting a MemoryError. However, the same code works for smaller arrays. For example:

Click to copy

     test= np.array([[1,2],[3,4]])
     train=np.array([[1,0],[0,1],[1,1]])

Is there a more memory efficient way to do the above computation without loops? Based on the posts online, we can implement L2- norm using matrix multiplication sqrt(X * X-2*X * Y+Y * Y). So I tried the following:

Click to copy

    x2 = np.dot(test, test.T)
    y2 = np.dot(train,train.T)
    xy = 2* np.dot(test,train.T)

    dist = np.sqrt(x2 - xy + y2)

Since the matrices have different shapes, when I tried to broadcast, there is a dimension mismatch and I am not sure what is the right way to broadcast (dont have much experience with Python broadcasting). I would like to know what is the right way to implement the L2 distance computation as a matrix multiplication in Python, where the matrices have different shapes. The resultant distance matrix should have dist[i,j] = Euclidean distance between test point i and sample point j.

thanks

223

asked Sep 30 '15 02:09

user1462351

2 Answers

Here is broadcasting with shapes of the intermediates made explicit:

Click to copy

m = x.shape[0] # x has shape (m, d)
n = y.shape[0] # y has shape (n, d)
x2 = np.sum(x**2, axis=1).reshape((m, 1))
y2 = np.sum(y**2, axis=1).reshape((1, n))
xy = x.dot(y.T) # shape is (m, n)
dists = np.sqrt(x2 + y2 - 2*xy) # shape is (m, n)

The documentation on broadcasting has some pretty good examples.

112

answered Sep 25 '22 06:09

spiky

I think what you are asking for already exists in scipy in the form of the cdist function.

Click to copy

from scipy.spatial.distance import cdist
res = cdist(test, train, metric='euclidean')

answered Sep 25 '22 06:09

John Greenall

Related questions
                            
                                NumPy - Faster way to implement threshold value ceiling
                            
                                Matplotlib drag overlapping points interactively
                            
                                Ipython is working in command prompt but not in browser
                            
                                Plotting communities with python igraph
                            
                                RuntimeWarnings with GPIO.setup and GPIO.cleanup not work with KeyboardInterrupt
                            
                                PYTHON IndexError: tuple index out of range
                            
                                Audio spectrum extraction from audio file by python
                            
                                Moving balls in Tkinter Canvas
                            
                                How do you scroll a GridLayout inside Kivy ScrollView?
                            
                                Django 1.7 google oauth2 token validation failure
                            
                                2D Gaussian Fit for intensities at certain coordinates in Python
                            
                                Fully disable python logging
                            
                                Numpy: find index of elements in one array that occur in another array
                            
                                Create variable name from two string in python
                            
                                Include with url variable in Django template
                            
                                list() takes at most 1 argument (3 given)
                            
                                How do I properly set DPI when saving a pillow image?
                            
                                Creating a matrix of arbitrary size where rows sum to 1?
                            
                                How to insert Billion of data to Redis efficiently?
                            
                                Python, Flask, Gunicorn Error: Unrecognized Arguments

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory Efficient L2 norm using Python broadcasting

Tags:

python

numpy

euclidean-distance

array-broadcasting

user1462351

People also ask

2 Answers

spiky

John Greenall

Recent Activity

Donate For Us