Fast fuse of close points in a numpy-2d (vectorized)

Question

I have a question similar to the question asked here: simple way of fusing a few close points. I want to replace points that are located close to each other with the average of their coordinates. The closeness in cells is specified by the user (I am talking about euclidean distance).

In my case I have a lot of points (about 1-million). This method is working, but is time consuming as it uses a double for loop.

Is there a faster way to detect and fuse close points in a numpy 2d array?

To be complete I added an example:

points=array([[  382.49056159,   640.1731949 ],
   [  496.44669161,   655.8583119 ],
   [ 1255.64762859,   672.99699399],
   [ 1070.16520917,   688.33538171],
   [  318.89390168,   718.05989421],
   [  259.7106383 ,   822.2       ],
   [  141.52574427,    28.68594436],
   [ 1061.13573287,    28.7094536 ],
   [  820.57417943,    84.27702407],
   [  806.71416007,   108.50307828]])

A scatterplot of the points is visible below. The red circle indicates the points located close to each other (in this case a distance of 27.91 between the last two points in the array). So if the user would specify a minimum distance of 30 these points should be fused.

enter image description here

In the output of the fuse function the last to points are fused. This will look like:

#output
array([[  382.49056159,   640.1731949 ],
   [  496.44669161,   655.8583119 ],
   [ 1255.64762859,   672.99699399],
   [ 1070.16520917,   688.33538171],
   [  318.89390168,   718.05989421],
   [  259.7106383 ,   822.2       ],
   [  141.52574427,    28.68594436],
   [ 1061.13573287,    28.7094536 ],
   [  813.64416975,    96.390051175]])

ali_m · Accepted Answer

If you have a large number of points then it may be faster to build a k-D tree using scipy.spatial.KDTree, then query it for pairs of points that are closer than some threshold:

import numpy as np
from scipy.spatial import KDTree

tree = KDTree(points)
rows_to_fuse = tree.query_pairs(r=30)    

print(repr(rows_to_fuse))
# {(8, 9)}

print(repr(points[list(rows_to_fuse)]))
# array([[ 820.57417943,   84.27702407],
#        [ 806.71416007,  108.50307828]])

The major advantage of this approach is that you don't need to compute the distance between every pair of points in your dataset.

Saullo G. P. Castro · Answer

You can use scipy's distance functions such as pdist in order to quickly find which points should be merged:

import numpy as np
from scipy.spatial.distance import pdist, squareform

d = squareform(pdist(a))
d = np.ma.array(d, mask=np.isclose(d, 0))
a[d.min(axis=1) < 30]
#array([[ 820.57417943,   84.27702407],
#       [ 806.71416007,  108.50307828]])

NOTE

For large samples this method can cause memory errors since it is storing a full matrix containing the relative distances.

Fast fuse of close points in a numpy-2d (vectorized)

Tags:

python

arrays

numpy

scipy

distance

Wilmar van Ommeren

2 Answers

ali_m

NOTE

Saullo G. P. Castro

Recent Activity

Donate For Us

Fast fuse of close points in a numpy-2d (vectorized)

Tags:

python

arrays

numpy

scipy

distance

Wilmar van Ommeren

2 Answers

ali_m

NOTE

Saullo G. P. Castro

Related questions

Recent Activity

Donate For Us