Memory Efficient Nearest Neighbour Algorithm

Question

I have 10,00,000 agents, each associated with (x,y) coordinates. I am trying to find agents close to each other (radius=1.5). I tried to implement this using PyTorch:

X = torch.DoubleTensor(1000000,2).uniform_(0,10000)
torch.cdist(X,X,p=2)

However, with this the session crashes. I am running this on google colab. The same happened when I tried constructing the graph using, radius_neighbors_graph of scikit-learn package. It would be of great help if someone suggested a memory efficient way to implement the same.

Mad Physicist · Accepted Answer

It's unlikely that you'll be able to compute a 1M*1M matrix in its entirety without thinking it through very carefully. You probably want something along the lines of scipy.spatial.KDTree. Once you've constructed a tree, you can pass the coordinates of an agent to the query method to get its neighbors within a certain radius. To get all the neighbors at once, you can come compute something like sparse_distance_matrix of the tree with itself at an appropriate threshold.

Alternatively, you can look into any number of efficient clustering algorithms.

user3856486 · Answer

I found three solutions, Solution 1

import torch
x = torch.randn(3000000, 2).cuda()
y = x

# Turn our Tensors into KeOps symbolic variables:
from pykeops.torch import LazyTensor
x_i = LazyTensor( x[:,None,:] ) 
y_j = LazyTensor( y[None,:,:] )  

# We can now perform large-scale computations, without memory overflows:
D_ij = ((x_i - y_j)**2).sum(dim=2)  
D_ij.argKmin(20,dim=1)

Solution 2

M = 3000000
import numpy as np
from pykeops.numpy import LazyTensor as LazyTensor_np
x = np.random.rand(M, 2)
y = x
x_i = LazyTensor_np(
    x[:, None, :]
)  # (M, 1, 2) KeOps LazyTensor, wrapped around the numpy array x
y_j = LazyTensor_np(
    y[None, :, :]
)  # (1, N, 2) KeOps LazyTensor, wrapped around the numpy array y

D_ij = ((x_i - y_j) ** 2).sum(-1)  # **Symbolic** (M, N) matrix of squared distances
s_i = D_ij.argKmin(20,dim=1).ravel()  # genuine (M,) array of integer indices

Solution 3

from sklearn.neighbors import NearestNeighbors
import numpy as np
M = 3000000
x = np.random.rand(M, 2)
nbrs = NearestNeighbors(n_neighbors=20, algorithm='ball_tree').fit(x)
distances, indices = nbrs.kneighbors(x)

Although the execution time of all the three solutions is the same, a minute, the memory requirements are approximately 2GB, 1GB and 1.3GB, respectively. It would be great to hear ideas to lower the execution time.

Memory Efficient Nearest Neighbour Algorithm

Tags:

python

numpy

pytorch

scikit-learn

user3856486

2 Answers

Mad Physicist

user3856486

Recent Activity

Donate For Us

Memory Efficient Nearest Neighbour Algorithm

Tags:

python

numpy

pytorch

scikit-learn

user3856486

2 Answers

Mad Physicist

user3856486

Related questions

Recent Activity

Donate For Us