I have to create a data structure to store distances from each point to every other point in a very large array of 2d-coordinates. It's easy to implement for small arrays, but beyond about 50,000 points I start running into memory issues -- not surprising, given that I'm creating an n x n matrix.
Here's a simple example which works fine:
import numpy as np
from scipy.spatial import distance
n = 2000
arr = np.random.rand(n,2)
d = distance.cdist(arr,arr)
cdist
is fast, but is inefficient in storage since the matrix is mirrored diagonally (e.g. d[i][j] == d[j][i]
). I can use np.triu(d)
to convert to upper triangular, but the resulting square matrix still takes the same memory. I also don't need distances beyond a certain cutoff, so that can be helpful. The next step is to convert to a sparse matrix to save memory:
from scipy import sparse
max_dist = 5
dist = np.array([[0,1,3,6], [1,0,8,7], [3,8,0,4], [6,7,4,0]])
print dist
array([[0, 1, 3, 6],
[1, 0, 8, 7],
[3, 8, 0, 4],
[6, 7, 4, 0]])
dist[dist>=max_dist] = 0
dist = np.triu(dist)
print dist
array([[0, 1, 3, 0],
[0, 0, 0, 0],
[0, 0, 0, 4],
[0, 0, 0, 0]])
sdist = sparse.lil_matrix(dist)
print sdist
(0, 1) 1
(2, 3) 4
(0, 2) 3
The problem is getting to that sparse matrix quickly for a very large dataset. To reiterate, making a square matrix with cdist is the fastest way I know of to calculate distances between points, but the intermediate square matrix runs out of memory. I could break it down into more manageable chunks of rows, but then that slows things down a lot. I feel like I'm missing some obvious easy way to go directly to a sparse matrix from cdist
.
Here is how to do it with a KDTree
:
>>> import numpy as np
>>> from scipy import sparse
>>> from scipy.spatial import cKDTree as KDTree
>>>
# mock data
>>> a = np.random.random((50000, 2))
>>>
# make tree
>>> A = KDTree(a)
>>>
# list all pairs within 0.05 of each other in 2-norm
# format: (i, j, v) - i, j are indices, v is distance
>>> D = A.sparse_distance_matrix(A, 0.05, p=2.0, output_type='ndarray')
>>>
# only keep upper triangle
>>> DU = D[D['i'] < D['j']]
>>>
# make sparse matrix
>>> result = sparse.coo_matrix((DU['v'], (DU['i'], DU['j'])), (50000, 50000))
>>> result
<50000x50000 sparse matrix of type '<class 'numpy.float64'>'
with 9412560 stored elements in COOrdinate format>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With