I have a set of points in a 3D space, from which I need to find the Pareto frontier. Speed of execution is very important here, and time increases very fast as I add points to test.
The set of points looks like this:
[[0.3296170319979843, 0.0, 0.44472108843537406], [0.3296170319979843,0.0, 0.44472108843537406], [0.32920760896951373, 0.0, 0.4440408163265306], [0.32920760896951373, 0.0, 0.4440408163265306], [0.33815192743764166, 0.0, 0.44356462585034007]]
Right now, I'm using this algorithm:
def dominates(row, candidateRow): return sum([row[x] >= candidateRow[x] for x in range(len(row))]) == len(row) def simple_cull(inputPoints, dominates): paretoPoints = set() candidateRowNr = 0 dominatedPoints = set() while True: candidateRow = inputPoints[candidateRowNr] inputPoints.remove(candidateRow) rowNr = 0 nonDominated = True while len(inputPoints) != 0 and rowNr < len(inputPoints): row = inputPoints[rowNr] if dominates(candidateRow, row): # If it is worse on all features remove the row from the array inputPoints.remove(row) dominatedPoints.add(tuple(row)) elif dominates(row, candidateRow): nonDominated = False dominatedPoints.add(tuple(candidateRow)) rowNr += 1 else: rowNr += 1 if nonDominated: # add the non-dominated point to the Pareto frontier paretoPoints.add(tuple(candidateRow)) if len(inputPoints) == 0: break return paretoPoints, dominatedPoints
Found here: http://code.activestate.com/recipes/578287-multidimensional-pareto-front/
What's the fastest way to find the set of non-dominated solutions in an ensemble of solutions? Or, in short, can Python do better than this algorithm?
goal = [minfn1,minfn2]; To calculate the Pareto front, take weight vectors [ a , 1 – a ] for a from 0 through 1. Solve the goal attainment problem, setting the weights to the various values. You can see the tradeoff between the two objective functions.
call a set S an ε-approximation of the Pareto-front P, if the directed Hausdorff distance between S and P is at most ε. They observe that an ε-approximation of any Pareto front P in d dimensions can be found using (1/ε)d queries.
In brief, Pareto optimal solution is defined as a set of 'non-inferior' solutions in the objective space defining a boundary beyond which none of the objectives can be improved without sacrificing at least one of the other objectives [17].
If you're worried about actual speed, you definitely want to use numpy (as the clever algorithmic tweaks probably have way less effect than the gains to be had from using array operations). Here are three solutions that all compute the same function. The is_pareto_efficient_dumb
solution is slower in most situations but becomes faster as the number of costs increases, the is_pareto_efficient_simple
solution is much more efficient than the dumb solution for many points, and the final is_pareto_efficient
function is less readable but the fastest (so all are Pareto Efficient!).
import numpy as np # Very slow for many datapoints. Fastest for many costs, most readable def is_pareto_efficient_dumb(costs): """ Find the pareto-efficient points :param costs: An (n_points, n_costs) array :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): is_efficient[i] = np.all(np.any(costs[:i]>c, axis=1)) and np.all(np.any(costs[i+1:]>c, axis=1)) return is_efficient # Fairly fast for many datapoints, less fast for many costs, somewhat readable def is_pareto_efficient_simple(costs): """ Find the pareto-efficient points :param costs: An (n_points, n_costs) array :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): if is_efficient[i]: is_efficient[is_efficient] = np.any(costs[is_efficient]<c, axis=1) # Keep any point with a lower cost is_efficient[i] = True # And keep self return is_efficient # Faster than is_pareto_efficient_simple, but less readable. def is_pareto_efficient(costs, return_mask = True): """ Find the pareto-efficient points :param costs: An (n_points, n_costs) array :param return_mask: True to return a mask :return: An array of indices of pareto-efficient points. If return_mask is True, this will be an (n_points, ) boolean array Otherwise it will be a (n_efficient_points, ) integer array of indices. """ is_efficient = np.arange(costs.shape[0]) n_points = costs.shape[0] next_point_index = 0 # Next index in the is_efficient array to search for while next_point_index<len(costs): nondominated_point_mask = np.any(costs<costs[next_point_index], axis=1) nondominated_point_mask[next_point_index] = True is_efficient = is_efficient[nondominated_point_mask] # Remove dominated points costs = costs[nondominated_point_mask] next_point_index = np.sum(nondominated_point_mask[:next_point_index])+1 if return_mask: is_efficient_mask = np.zeros(n_points, dtype = bool) is_efficient_mask[is_efficient] = True return is_efficient_mask else: return is_efficient
Profiling tests (using points drawn from a Normal distribution):
With 10,000 samples, 2 costs:
is_pareto_efficient_dumb: Elapsed time is 1.586s is_pareto_efficient_simple: Elapsed time is 0.009653s is_pareto_efficient: Elapsed time is 0.005479s
With 1,000,000 samples, 2 costs:
is_pareto_efficient_dumb: Really, really, slow is_pareto_efficient_simple: Elapsed time is 1.174s is_pareto_efficient: Elapsed time is 0.4033s
With 10,000 samples, 15 costs:
is_pareto_efficient_dumb: Elapsed time is 4.019s is_pareto_efficient_simple: Elapsed time is 6.466s is_pareto_efficient: Elapsed time is 6.41s
Note that if efficiency is a concern you can gain maybe a further 2x or so speedup by reordering your data beforehand, see here.
Here is another simple implementation that is pretty fast for modest dimensions. Input points are assumed to be unique.
def keep_efficient(pts): 'returns Pareto efficient row subset of pts' # sort points by decreasing sum of coordinates pts = pts[pts.sum(1).argsort()[::-1]] # initialize a boolean mask for undominated points # to avoid creating copies each iteration undominated = np.ones(pts.shape[0], dtype=bool) for i in range(pts.shape[0]): # process each point in turn n = pts.shape[0] if i >= n: break # find all points not dominated by i # since points are sorted by coordinate sum # i cannot dominate any points in 1,...,i-1 undominated[i+1:n] = (pts[i+1:] >= pts[i]).any(1) # keep points undominated so far pts = pts[undominated[:n]] return pts
We start by sorting points according to the sum of coordinates. This is useful because
x
has a larger coordinate sum than point y
, then y
cannot dominate x
.Here are some benchmarks relative to Peter's answer, using np.random.randn
.
N=10000 d=2 keep_efficient 1.31 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) is_pareto_efficient 6.51 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) N=10000 d=3 keep_efficient 2.3 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 16.4 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) N=10000 d=4 keep_efficient 4.37 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 21.1 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) N=10000 d=5 keep_efficient 15.1 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 110 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) N=10000 d=6 keep_efficient 40.1 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) is_pareto_efficient 279 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) N=10000 d=15 keep_efficient 3.92 s ± 125 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) is_pareto_efficient 5.88 s ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I ended up looking at this problem recently and found a useful heuristic that works well if there are many points distributed independently and dimensions are few.
The idea is to compute the convex hull of points. With few dimensions and independently distributed points, the number of vertices of the convex hull will be small. Intuitively, we can expect some vertices of the convex hull to dominate many of the original points. Moreover, if a point in a convex hull is not dominated by any other point in the convex hull, then it is also not dominated by any point in the original set.
This gives a simple iterative algorithm. We repeatedly
I add a few benchmarks for dimension 3. It seems that for some distribution of points this approach yields better asymptotics.
import numpy as np from scipy import spatial from functools import reduce # test points pts = np.random.rand(10_000_000, 3) def filter_(pts, pt): """ Get all points in pts that are not Pareto dominated by the point pt """ weakly_worse = (pts <= pt).all(axis=-1) strictly_worse = (pts < pt).any(axis=-1) return pts[~(weakly_worse & strictly_worse)] def get_pareto_undominated_by(pts1, pts2=None): """ Return all points in pts1 that are not Pareto dominated by any points in pts2 """ if pts2 is None: pts2 = pts1 return reduce(filter_, pts2, pts1) def get_pareto_frontier(pts): """ Iteratively filter points based on the convex hull heuristic """ pareto_groups = [] # loop while there are points remaining while pts.shape[0]: # brute force if there are few points: if pts.shape[0] < 10: pareto_groups.append(get_pareto_undominated_by(pts)) break # compute vertices of the convex hull hull_vertices = spatial.ConvexHull(pts).vertices # get corresponding points hull_pts = pts[hull_vertices] # get points in pts that are not convex hull vertices nonhull_mask = np.ones(pts.shape[0], dtype=bool) nonhull_mask[hull_vertices] = False pts = pts[nonhull_mask] # get points in the convex hull that are on the Pareto frontier pareto = get_pareto_undominated_by(hull_pts) pareto_groups.append(pareto) # filter remaining points to keep those not dominated by # Pareto points of the convex hull pts = get_pareto_undominated_by(pts, pareto) return np.vstack(pareto_groups) # -------------------------------------------------------------------------------- # previous solutions # -------------------------------------------------------------------------------- def is_pareto_efficient_dumb(costs): """ :param costs: An (n_points, n_costs) array :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): is_efficient[i] = np.all(np.any(costs>=c, axis=1)) return is_efficient def is_pareto_efficient(costs): """ :param costs: An (n_points, n_costs) array :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): if is_efficient[i]: is_efficient[is_efficient] = np.any(costs[is_efficient]<=c, axis=1) # Remove dominated points return is_efficient def dominates(row, rowCandidate): return all(r >= rc for r, rc in zip(row, rowCandidate)) def cull(pts, dominates): dominated = [] cleared = [] remaining = pts while remaining: candidate = remaining[0] new_remaining = [] for other in remaining[1:]: [new_remaining, dominated][dominates(candidate, other)].append(other) if not any(dominates(other, candidate) for other in new_remaining): cleared.append(candidate) else: dominated.append(candidate) remaining = new_remaining return cleared, dominated # -------------------------------------------------------------------------------- # benchmarking # -------------------------------------------------------------------------------- # to accomodate the original non-numpy solution pts_list = [list(pt) for pt in pts] import timeit # print('Old non-numpy solution:s\t{}'.format( # timeit.timeit('cull(pts_list, dominates)', number=3, globals=globals()))) print('Numpy solution:\t{}'.format( timeit.timeit('is_pareto_efficient(pts)', number=3, globals=globals()))) print('Convex hull heuristic:\t{}'.format( timeit.timeit('get_pareto_frontier(pts)', number=3, globals=globals())))
# >>= python temp.py # 1,000 points # Old non-numpy solution: 0.0316428339574486 # Numpy solution: 0.005961259012110531 # Convex hull heuristic: 0.012369581032544374 # >>= python temp.py # 1,000,000 points # Old non-numpy solution: 70.67529802105855 # Numpy solution: 5.398462114972062 # Convex hull heuristic: 1.5286884519737214 # >>= python temp.py # 10,000,000 points # Numpy solution: 98.03680767398328 # Convex hull heuristic: 10.203076395904645
I took a shot at rewriting the same algorithm with a couple of tweaks. I think most of your problems come from inputPoints.remove(row)
. This requires searching through the list of points; removing by index would be much more efficient. I also modified the dominates
function to avoid some redundant comparisons. This could be handy in a higher dimension.
def dominates(row, rowCandidate): return all(r >= rc for r, rc in zip(row, rowCandidate)) def cull(pts, dominates): dominated = [] cleared = [] remaining = pts while remaining: candidate = remaining[0] new_remaining = [] for other in remaining[1:]: [new_remaining, dominated][dominates(candidate, other)].append(other) if not any(dominates(other, candidate) for other in new_remaining): cleared.append(candidate) else: dominated.append(candidate) remaining = new_remaining return cleared, dominated
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With