Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory Efficient Nearest Neighbour Algorithm

I have 10,00,000 agents, each associated with (x,y) coordinates. I am trying to find agents close to each other (radius=1.5). I tried to implement this using PyTorch:

X = torch.DoubleTensor(1000000,2).uniform_(0,10000)
torch.cdist(X,X,p=2)

However, with this the session crashes. I am running this on google colab. The same happened when I tried constructing the graph using, radius_neighbors_graph of scikit-learn package. It would be of great help if someone suggested a memory efficient way to implement the same.

like image 540
user3856486 Avatar asked Apr 29 '26 02:04

user3856486


2 Answers

It's unlikely that you'll be able to compute a 1M*1M matrix in its entirety without thinking it through very carefully. You probably want something along the lines of scipy.spatial.KDTree. Once you've constructed a tree, you can pass the coordinates of an agent to the query method to get its neighbors within a certain radius. To get all the neighbors at once, you can come compute something like sparse_distance_matrix of the tree with itself at an appropriate threshold.

Alternatively, you can look into any number of efficient clustering algorithms.

like image 103
Mad Physicist Avatar answered May 01 '26 14:05

Mad Physicist


I found three solutions, Solution 1

import torch
x = torch.randn(3000000, 2).cuda()
y = x

# Turn our Tensors into KeOps symbolic variables:
from pykeops.torch import LazyTensor
x_i = LazyTensor( x[:,None,:] ) 
y_j = LazyTensor( y[None,:,:] )  

# We can now perform large-scale computations, without memory overflows:
D_ij = ((x_i - y_j)**2).sum(dim=2)  
D_ij.argKmin(20,dim=1)

Solution 2

M = 3000000
import numpy as np
from pykeops.numpy import LazyTensor as LazyTensor_np
x = np.random.rand(M, 2)
y = x
x_i = LazyTensor_np(
    x[:, None, :]
)  # (M, 1, 2) KeOps LazyTensor, wrapped around the numpy array x
y_j = LazyTensor_np(
    y[None, :, :]
)  # (1, N, 2) KeOps LazyTensor, wrapped around the numpy array y

D_ij = ((x_i - y_j) ** 2).sum(-1)  # **Symbolic** (M, N) matrix of squared distances
s_i = D_ij.argKmin(20,dim=1).ravel()  # genuine (M,) array of integer indices

Solution 3

from sklearn.neighbors import NearestNeighbors
import numpy as np
M = 3000000
x = np.random.rand(M, 2)
nbrs = NearestNeighbors(n_neighbors=20, algorithm='ball_tree').fit(x)
distances, indices = nbrs.kneighbors(x)

Although the execution time of all the three solutions is the same, a minute, the memory requirements are approximately 2GB, 1GB and 1.3GB, respectively. It would be great to hear ideas to lower the execution time.

like image 44
user3856486 Avatar answered May 01 '26 14:05

user3856486