Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast calculation of Pareto front in Python

Tags:

python

numpy

I have a set of points in a 3D space, from which I need to find the Pareto frontier. Speed of execution is very important here, and time increases very fast as I add points to test.

The set of points looks like this:

[[0.3296170319979843, 0.0, 0.44472108843537406], [0.3296170319979843,0.0, 0.44472108843537406], [0.32920760896951373, 0.0, 0.4440408163265306], [0.32920760896951373, 0.0, 0.4440408163265306], [0.33815192743764166, 0.0, 0.44356462585034007]] 

Right now, I'm using this algorithm:

def dominates(row, candidateRow):     return sum([row[x] >= candidateRow[x] for x in range(len(row))]) == len(row)   def simple_cull(inputPoints, dominates):     paretoPoints = set()     candidateRowNr = 0     dominatedPoints = set()     while True:         candidateRow = inputPoints[candidateRowNr]         inputPoints.remove(candidateRow)         rowNr = 0         nonDominated = True         while len(inputPoints) != 0 and rowNr < len(inputPoints):             row = inputPoints[rowNr]             if dominates(candidateRow, row):                 # If it is worse on all features remove the row from the array                 inputPoints.remove(row)                 dominatedPoints.add(tuple(row))             elif dominates(row, candidateRow):                 nonDominated = False                 dominatedPoints.add(tuple(candidateRow))                 rowNr += 1             else:                 rowNr += 1          if nonDominated:             # add the non-dominated point to the Pareto frontier             paretoPoints.add(tuple(candidateRow))          if len(inputPoints) == 0:             break     return paretoPoints, dominatedPoints 

Found here: http://code.activestate.com/recipes/578287-multidimensional-pareto-front/

What's the fastest way to find the set of non-dominated solutions in an ensemble of solutions? Or, in short, can Python do better than this algorithm?

like image 926
Lucien S. Avatar asked Sep 25 '15 23:09

Lucien S.


People also ask

How is Pareto front calculated?

goal = [minfn1,minfn2]; To calculate the Pareto front, take weight vectors [ a , 1 – a ] for a from 0 through 1. Solve the goal attainment problem, setting the weights to the various values. You can see the tradeoff between the two objective functions.

What is Pareto front approximation?

call a set S an ε-approximation of the Pareto-front P, if the directed Hausdorff distance between S and P is at most ε. They observe that an ε-approximation of any Pareto front P in d dimensions can be found using (1/ε)d queries.

What is Pareto optimal solution?

In brief, Pareto optimal solution is defined as a set of 'non-inferior' solutions in the objective space defining a boundary beyond which none of the objectives can be improved without sacrificing at least one of the other objectives [17].


2 Answers

If you're worried about actual speed, you definitely want to use numpy (as the clever algorithmic tweaks probably have way less effect than the gains to be had from using array operations). Here are three solutions that all compute the same function. The is_pareto_efficient_dumb solution is slower in most situations but becomes faster as the number of costs increases, the is_pareto_efficient_simple solution is much more efficient than the dumb solution for many points, and the final is_pareto_efficient function is less readable but the fastest (so all are Pareto Efficient!).

import numpy as np   # Very slow for many datapoints.  Fastest for many costs, most readable def is_pareto_efficient_dumb(costs):     """     Find the pareto-efficient points     :param costs: An (n_points, n_costs) array     :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient     """     is_efficient = np.ones(costs.shape[0], dtype = bool)     for i, c in enumerate(costs):         is_efficient[i] = np.all(np.any(costs[:i]>c, axis=1)) and np.all(np.any(costs[i+1:]>c, axis=1))     return is_efficient   # Fairly fast for many datapoints, less fast for many costs, somewhat readable def is_pareto_efficient_simple(costs):     """     Find the pareto-efficient points     :param costs: An (n_points, n_costs) array     :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient     """     is_efficient = np.ones(costs.shape[0], dtype = bool)     for i, c in enumerate(costs):         if is_efficient[i]:             is_efficient[is_efficient] = np.any(costs[is_efficient]<c, axis=1)  # Keep any point with a lower cost             is_efficient[i] = True  # And keep self     return is_efficient   # Faster than is_pareto_efficient_simple, but less readable. def is_pareto_efficient(costs, return_mask = True):     """     Find the pareto-efficient points     :param costs: An (n_points, n_costs) array     :param return_mask: True to return a mask     :return: An array of indices of pareto-efficient points.         If return_mask is True, this will be an (n_points, ) boolean array         Otherwise it will be a (n_efficient_points, ) integer array of indices.     """     is_efficient = np.arange(costs.shape[0])     n_points = costs.shape[0]     next_point_index = 0  # Next index in the is_efficient array to search for     while next_point_index<len(costs):         nondominated_point_mask = np.any(costs<costs[next_point_index], axis=1)         nondominated_point_mask[next_point_index] = True         is_efficient = is_efficient[nondominated_point_mask]  # Remove dominated points         costs = costs[nondominated_point_mask]         next_point_index = np.sum(nondominated_point_mask[:next_point_index])+1     if return_mask:         is_efficient_mask = np.zeros(n_points, dtype = bool)         is_efficient_mask[is_efficient] = True         return is_efficient_mask     else:         return is_efficient 

Profiling tests (using points drawn from a Normal distribution):

With 10,000 samples, 2 costs:

is_pareto_efficient_dumb: Elapsed time is 1.586s is_pareto_efficient_simple: Elapsed time is 0.009653s is_pareto_efficient: Elapsed time is 0.005479s 

With 1,000,000 samples, 2 costs:

is_pareto_efficient_dumb: Really, really, slow is_pareto_efficient_simple: Elapsed time is 1.174s is_pareto_efficient: Elapsed time is 0.4033s 

With 10,000 samples, 15 costs:

is_pareto_efficient_dumb: Elapsed time is 4.019s is_pareto_efficient_simple: Elapsed time is 6.466s is_pareto_efficient: Elapsed time is 6.41s 

Note that if efficiency is a concern you can gain maybe a further 2x or so speedup by reordering your data beforehand, see here.

like image 130
Peter Avatar answered Oct 28 '22 02:10

Peter


Updated Aug 2019

Here is another simple implementation that is pretty fast for modest dimensions. Input points are assumed to be unique.

def keep_efficient(pts):     'returns Pareto efficient row subset of pts'     # sort points by decreasing sum of coordinates     pts = pts[pts.sum(1).argsort()[::-1]]     # initialize a boolean mask for undominated points     # to avoid creating copies each iteration     undominated = np.ones(pts.shape[0], dtype=bool)     for i in range(pts.shape[0]):         # process each point in turn         n = pts.shape[0]         if i >= n:             break         # find all points not dominated by i         # since points are sorted by coordinate sum         # i cannot dominate any points in 1,...,i-1         undominated[i+1:n] = (pts[i+1:] >= pts[i]).any(1)          # keep points undominated so far         pts = pts[undominated[:n]]     return pts 

We start by sorting points according to the sum of coordinates. This is useful because

  • For many distributions of data, a point with a largest coordinate sum will dominate a large number of points.
  • If point x has a larger coordinate sum than point y, then y cannot dominate x.

Here are some benchmarks relative to Peter's answer, using np.random.randn.

N=10000 d=2  keep_efficient 1.31 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) is_pareto_efficient 6.51 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   N=10000 d=3  keep_efficient 2.3 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 16.4 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   N=10000 d=4  keep_efficient 4.37 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 21.1 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   N=10000 d=5  keep_efficient 15.1 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 110 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   N=10000 d=6  keep_efficient 40.1 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) is_pareto_efficient 279 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   N=10000 d=15  keep_efficient 3.92 s ± 125 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) is_pareto_efficient 5.88 s ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

Convex Hull Heuristic

I ended up looking at this problem recently and found a useful heuristic that works well if there are many points distributed independently and dimensions are few.

The idea is to compute the convex hull of points. With few dimensions and independently distributed points, the number of vertices of the convex hull will be small. Intuitively, we can expect some vertices of the convex hull to dominate many of the original points. Moreover, if a point in a convex hull is not dominated by any other point in the convex hull, then it is also not dominated by any point in the original set.

This gives a simple iterative algorithm. We repeatedly

  1. Compute the convex hull.
  2. Save Pareto undominated points from the convex hull.
  3. Filter the points to remove those dominated by elements of the convex hull.

I add a few benchmarks for dimension 3. It seems that for some distribution of points this approach yields better asymptotics.

import numpy as np from scipy import spatial from functools import reduce  # test points pts = np.random.rand(10_000_000, 3)   def filter_(pts, pt):     """     Get all points in pts that are not Pareto dominated by the point pt     """     weakly_worse   = (pts <= pt).all(axis=-1)     strictly_worse = (pts < pt).any(axis=-1)     return pts[~(weakly_worse & strictly_worse)]   def get_pareto_undominated_by(pts1, pts2=None):     """     Return all points in pts1 that are not Pareto dominated     by any points in pts2     """     if pts2 is None:         pts2 = pts1     return reduce(filter_, pts2, pts1)   def get_pareto_frontier(pts):     """     Iteratively filter points based on the convex hull heuristic     """     pareto_groups = []      # loop while there are points remaining     while pts.shape[0]:         # brute force if there are few points:         if pts.shape[0] < 10:             pareto_groups.append(get_pareto_undominated_by(pts))             break          # compute vertices of the convex hull         hull_vertices = spatial.ConvexHull(pts).vertices          # get corresponding points         hull_pts = pts[hull_vertices]          # get points in pts that are not convex hull vertices         nonhull_mask = np.ones(pts.shape[0], dtype=bool)         nonhull_mask[hull_vertices] = False         pts = pts[nonhull_mask]          # get points in the convex hull that are on the Pareto frontier         pareto   = get_pareto_undominated_by(hull_pts)         pareto_groups.append(pareto)          # filter remaining points to keep those not dominated by         # Pareto points of the convex hull         pts = get_pareto_undominated_by(pts, pareto)      return np.vstack(pareto_groups)  # -------------------------------------------------------------------------------- # previous solutions # --------------------------------------------------------------------------------  def is_pareto_efficient_dumb(costs):     """     :param costs: An (n_points, n_costs) array     :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient     """     is_efficient = np.ones(costs.shape[0], dtype = bool)     for i, c in enumerate(costs):         is_efficient[i] = np.all(np.any(costs>=c, axis=1))     return is_efficient   def is_pareto_efficient(costs):     """     :param costs: An (n_points, n_costs) array     :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient     """     is_efficient = np.ones(costs.shape[0], dtype = bool)     for i, c in enumerate(costs):         if is_efficient[i]:             is_efficient[is_efficient] = np.any(costs[is_efficient]<=c, axis=1)  # Remove dominated points     return is_efficient   def dominates(row, rowCandidate):     return all(r >= rc for r, rc in zip(row, rowCandidate))   def cull(pts, dominates):     dominated = []     cleared = []     remaining = pts     while remaining:         candidate = remaining[0]         new_remaining = []         for other in remaining[1:]:             [new_remaining, dominated][dominates(candidate, other)].append(other)         if not any(dominates(other, candidate) for other in new_remaining):             cleared.append(candidate)         else:             dominated.append(candidate)         remaining = new_remaining     return cleared, dominated  # -------------------------------------------------------------------------------- # benchmarking # --------------------------------------------------------------------------------  # to accomodate the original non-numpy solution pts_list = [list(pt) for pt in pts]  import timeit  # print('Old non-numpy solution:s\t{}'.format(     # timeit.timeit('cull(pts_list, dominates)', number=3, globals=globals())))  print('Numpy solution:\t{}'.format(     timeit.timeit('is_pareto_efficient(pts)', number=3, globals=globals())))  print('Convex hull heuristic:\t{}'.format(     timeit.timeit('get_pareto_frontier(pts)', number=3, globals=globals()))) 

Results

# >>= python temp.py # 1,000 points # Old non-numpy solution:      0.0316428339574486 # Numpy solution:              0.005961259012110531 # Convex hull heuristic:       0.012369581032544374 # >>= python temp.py # 1,000,000 points # Old non-numpy solution:      70.67529802105855 # Numpy solution:              5.398462114972062 # Convex hull heuristic:       1.5286884519737214 # >>= python temp.py # 10,000,000 points # Numpy solution:              98.03680767398328 # Convex hull heuristic:       10.203076395904645 

Original Post

I took a shot at rewriting the same algorithm with a couple of tweaks. I think most of your problems come from inputPoints.remove(row). This requires searching through the list of points; removing by index would be much more efficient. I also modified the dominates function to avoid some redundant comparisons. This could be handy in a higher dimension.

def dominates(row, rowCandidate):     return all(r >= rc for r, rc in zip(row, rowCandidate))  def cull(pts, dominates):     dominated = []     cleared = []     remaining = pts     while remaining:         candidate = remaining[0]         new_remaining = []         for other in remaining[1:]:             [new_remaining, dominated][dominates(candidate, other)].append(other)         if not any(dominates(other, candidate) for other in new_remaining):             cleared.append(candidate)         else:             dominated.append(candidate)         remaining = new_remaining     return cleared, dominated 
like image 26
hilberts_drinking_problem Avatar answered Oct 28 '22 02:10

hilberts_drinking_problem