Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing subset of farthest points in given set of points

Imagine you are given set S of n points in 3 dimensions. Distance between any 2 points is simple Euclidean distance. You want to chose subset Q of k points from this set such that they are farthest from each other. In other words there is no other subset Q’ of k points exists such that min of all pair wise distances in Q is less than that in Q’.

If n is approximately 16 million and k is about 300, how do we efficiently do this?

My guess is that this NP-hard so may be we just want to focus on approximation. One idea I can think of is using Multidimensional scaling to sort these points in a line and then use version of binary search to get points that are furthest apart on this line.

like image 407
Shital Shah Avatar asked Feb 22 '18 10:02

Shital Shah


2 Answers

This is known as the discrete p-dispersion (maxmin) problem.

The optimality bound is proved in White (1991) and Ravi et al. (1994) give a factor-2 approximation for the problem with the latter proving this heuristic is the best possible (unless P=NP).

Factor-2 Approximation

The factor-2 approximation is as follows:

Let V be the set of nodes/objects
Let i and j be two nodes at maximum distance
Let p be the number of objects to choose
p = set([i,j])
while size(P)<p:
  Find a node v in V-P such that min_{v' in P} dist(v,v') is maximum
  \That is: find the node with the greatest minimum distance to the set P
  P = P.union(v)
Output P

You could implement this in Python like so:

#!/usr/bin/env python3

import numpy as np

p = 50
N = 400

print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2         #Make the matrix symmetric

print("Finding initial edge...")
maxdist  = 0
bestpair = ()
for i in range(N):
  for j in range(i+1,N):
    if d[i,j]>maxdist:
      maxdist = d[i,j]
      bestpair = (i,j)

P = set()
P.add(bestpair[0])
P.add(bestpair[1])

print("Finding optimal set...")
while len(P)<p:
  print("P size = {0}".format(len(P)))
  maxdist = 0
  vbest = None
  for v in range(N):
    if v in P:
      continue
    for vprime in P:
      if d[v,vprime]>maxdist:
        maxdist = d[v,vprime]
        vbest   = v
  P.add(vbest)

print(P)

Exact Solution

You could also model this as an MIP. For p=50, n=400 after 6000s, the optimality gap was still 568%. The approximation algorithm took 0.47s to obtain an optimality gap of 100% (or less). A naive Gurobi Python representation might look like this:

#!/usr/bin/env python
import numpy as np
import gurobipy as grb

p = 50
N = 400

print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2             #Make the matrix symmetric

m = grb.Model(name="MIP Model")

used  = [m.addVar(vtype=grb.GRB.BINARY) for i in range(N)]

objective = grb.quicksum( d[i,j]*used[i]*used[j] for i in range(0,N) for j in range(i+1,N) )

m.addConstr(
  lhs=grb.quicksum(used),
  sense=grb.GRB.EQUAL,
  rhs=p
)

# for maximization
m.ModelSense = grb.GRB.MAXIMIZE
m.setObjective(objective)

# m.Params.TimeLimit = 3*60

# solving with Glpk
ret = m.optimize()

Scaling

Obviously, the O(N^2) scaling for the initial points is bad. We can find them more efficiently by recognizing that the pair must lie on the convex hull of the dataset. This gives us an O(N log N) way to find the pair. Once we've found it we proceed as before (using SciPy for acceleration).

The best way of scaling would be to use an R*-tree to efficiently find the minimum distance between a candidate point p and the set P. But this cannot be done efficiently in Python since a for loop is still involved.

import numpy as np
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cdist

p = 300
N = 16000000

# Find a convex hull in O(N log N)
points = np.random.rand(N, 3)   # N random points in 3-D

# Returned 420 points in testing
hull = ConvexHull(points)

# Extract the points forming the hull
hullpoints = points[hull.vertices,:]

# Naive way of finding the best pair in O(H^2) time if H is number of points on
# hull
hdist = cdist(hullpoints, hullpoints, metric='euclidean')

# Get the farthest apart points
bestpair = np.unravel_index(hdist.argmax(), hdist.shape)

P = np.array([hullpoints[bestpair[0]],hullpoints[bestpair[1]]])

# Now we have a problem
print("Finding optimal set...")
while len(P)<p:
  print("P size = {0}".format(len(P)))
  distance_to_P        = cdist(points, P)
  minimum_to_each_of_P = np.min(distance_to_P, axis=1)
  best_new_point_idx   = np.argmax(minimum_to_each_of_P)
  best_new_point = np.expand_dims(points[best_new_point_idx,:],0)
  P = np.append(P,best_new_point,axis=0)

print(P)
like image 89
Richard Avatar answered Oct 06 '22 09:10

Richard


I am also pretty sure that the problem is NP-Hard, the most similar problem I found is the k-Center Problem. If runtime is more important than correctness a greedy algorithm is probably your best choice:

Q ={}
while |Q| < k
    Q += p from S where mindist(p, Q) is maximal

Side note: In similar problems e.g., the set-cover problem it can be shown that the solution from the greedy algorithm is at least 63% as good as the optimal solution.

In order to speed things up I see 3 possibilities:

  1. Index your dataset in an R-Tree first, then perform a greedy search. Construction of the R-Tree is O(n log n), but though being developed for nearest neighbor search, it can also help you finding the furthest point to a set of points in O(log n). This might be faster than the naive O(k*n) algorithm.

  2. Sample a subset from your 16 million points and perform the greedy algorithm on the subset. You are approximate anyway so you might be able to spare a little more accuracy. You can also combine this with the 1. algorithm.

  3. Use an iterative approach and stop when you are out of time. The idea here is to randomly select k points from S (lets call this set Q'). Then in each step you switch the point p_ from Q' that has the minimum distance to another one in Q' with a random point from S. If the resulting set Q'' is better proceed with Q'', otherwise repeat with Q'. In order not to get stuck you might want to choose another point from Q' than p_ if you could not find an adequate replacement for a couple of iterations.

like image 37
SaiBot Avatar answered Oct 06 '22 07:10

SaiBot