I have 10,00,000 agents, each associated with (x,y) coordinates. I am trying to find agents close to each other (radius=1.5). I tried to implement this using PyTorch:
X = torch.DoubleTensor(1000000,2).uniform_(0,10000)
torch.cdist(X,X,p=2)
However, with this the session crashes. I am running this on google colab. The same happened when I tried constructing the graph using, radius_neighbors_graph of scikit-learn package. It would be of great help if someone suggested a memory efficient way to implement the same.
It's unlikely that you'll be able to compute a 1M*1M matrix in its entirety without thinking it through very carefully. You probably want something along the lines of scipy.spatial.KDTree. Once you've constructed a tree, you can pass the coordinates of an agent to the query method to get its neighbors within a certain radius. To get all the neighbors at once, you can come compute something like sparse_distance_matrix of the tree with itself at an appropriate threshold.
Alternatively, you can look into any number of efficient clustering algorithms.
I found three solutions, Solution 1
import torch
x = torch.randn(3000000, 2).cuda()
y = x
# Turn our Tensors into KeOps symbolic variables:
from pykeops.torch import LazyTensor
x_i = LazyTensor( x[:,None,:] )
y_j = LazyTensor( y[None,:,:] )
# We can now perform large-scale computations, without memory overflows:
D_ij = ((x_i - y_j)**2).sum(dim=2)
D_ij.argKmin(20,dim=1)
Solution 2
M = 3000000
import numpy as np
from pykeops.numpy import LazyTensor as LazyTensor_np
x = np.random.rand(M, 2)
y = x
x_i = LazyTensor_np(
x[:, None, :]
) # (M, 1, 2) KeOps LazyTensor, wrapped around the numpy array x
y_j = LazyTensor_np(
y[None, :, :]
) # (1, N, 2) KeOps LazyTensor, wrapped around the numpy array y
D_ij = ((x_i - y_j) ** 2).sum(-1) # **Symbolic** (M, N) matrix of squared distances
s_i = D_ij.argKmin(20,dim=1).ravel() # genuine (M,) array of integer indices
Solution 3
from sklearn.neighbors import NearestNeighbors
import numpy as np
M = 3000000
x = np.random.rand(M, 2)
nbrs = NearestNeighbors(n_neighbors=20, algorithm='ball_tree').fit(x)
distances, indices = nbrs.kneighbors(x)
Although the execution time of all the three solutions is the same, a minute, the memory requirements are approximately 2GB, 1GB and 1.3GB, respectively. It would be great to hear ideas to lower the execution time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With