Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently determine if a set of points contains two that are close

I need to determine if a set of points (each given by a tuple of floats, each of which is in [0, 1]) contains two that are within some threshold, say 0.01, of each other. I should also mention that in the version of the problem that I am interested in, these "points" are given by tuples of length ~30, that is they are points in [0, 1]^30.

I can test if any two are within this threshold using something like:

def is_near(p1, p2):
    return sqrt(sum((x1 - x2)**2 for x1, x2 in zip(p1, p2))) < 0.01  # Threshold.

Using this I can just check every pair using something like:

def contains_near(points):
     from itertools import combinations
     return any(is_near(p1, p2) for p1, p2 in combinations(points, r=2))

However this is quadratic in the length of the list, which is too slow for the long list of points that I have.

Is there an n log n way of solving this?

I tried doing things like snapping these points to a grid so I could use a dictionary / hash map to store them:

def contains_near_hash(points):
     seen = dict()
     for point in points:
         # The rescaling constant should be 1 / threshold.
         grid_point = tuple([round(x * 100, 0) for x in point])  
         if grid_point in seen:
             for other in seen[grid_point]:
                 if is_near(point, other):
                      return True
             seen[grid_point].append(point)
         else:
             seen[grid_point] = [point]
    return False

However this doesn't work when

points = [(0.1149999,), (0.1150001,)]

As these round to two different grid points. I also tried a version in which the point was appended to all neighbouring grid points however as the examples that I want to do have ~30 coordinates, each grid point has 2^30 neighbours which makes this completely impractical.

like image 246
Mark Bell Avatar asked Apr 19 '15 11:04

Mark Bell


1 Answers

A pair of points can only be 'near' each other if their distance in at least one dimension is less than the threshold. This can be exploited to reduce the number of candidate pairs by filtering one dimension after the other.

I suggest:
- sort the points in one dimension (say: x)
- find all points which are close enough to the next point in the sorted list and put their index into a set of candidates
- do not use sqrt() but the quadratic distance (x1 - x2)**2 or even abs(x1 - x2) for efficiency
- do that for the second dimension as well
- determine the intersect of both sets, these are points near each other

This way, you avoid costly is_near() calls, operate on way smaller sets, only work with unique points, and set lookups are very efficient.

This scheme can easily be expanded to include more than 2 dimensions.

like image 157
user1016274 Avatar answered Oct 08 '22 17:10

user1016274