Fast calculation of Pareto front in Python

Tags:

numpy

I have a set of points in a 3D space, from which I need to find the Pareto frontier. Speed of execution is very important here, and time increases very fast as I add points to test.

The set of points looks like this:

[[0.3296170319979843, 0.0, 0.44472108843537406], [0.3296170319979843,0.0, 0.44472108843537406], [0.32920760896951373, 0.0, 0.4440408163265306], [0.32920760896951373, 0.0, 0.4440408163265306], [0.33815192743764166, 0.0, 0.44356462585034007]]

Right now, I'm using this algorithm:

def dominates(row, candidateRow):     return sum([row[x] >= candidateRow[x] for x in range(len(row))]) == len(row)   def simple_cull(inputPoints, dominates):     paretoPoints = set()     candidateRowNr = 0     dominatedPoints = set()     while True:         candidateRow = inputPoints[candidateRowNr]         inputPoints.remove(candidateRow)         rowNr = 0         nonDominated = True         while len(inputPoints) != 0 and rowNr < len(inputPoints):             row = inputPoints[rowNr]             if dominates(candidateRow, row):                 # If it is worse on all features remove the row from the array                 inputPoints.remove(row)                 dominatedPoints.add(tuple(row))             elif dominates(row, candidateRow):                 nonDominated = False                 dominatedPoints.add(tuple(candidateRow))                 rowNr += 1             else:                 rowNr += 1          if nonDominated:             # add the non-dominated point to the Pareto frontier             paretoPoints.add(tuple(candidateRow))          if len(inputPoints) == 0:             break     return paretoPoints, dominatedPoints

Found here: http://code.activestate.com/recipes/578287-multidimensional-pareto-front/

What's the fastest way to find the set of non-dominated solutions in an ensemble of solutions? Or, in short, can Python do better than this algorithm?

926

asked Sep 25 '15 23:09

2 Answers

If you're worried about actual speed, you definitely want to use numpy (as the clever algorithmic tweaks probably have way less effect than the gains to be had from using array operations). Here are three solutions that all compute the same function. The is_pareto_efficient_dumb solution is slower in most situations but becomes faster as the number of costs increases, the is_pareto_efficient_simple solution is much more efficient than the dumb solution for many points, and the final is_pareto_efficient function is less readable but the fastest (so all are Pareto Efficient!).

import numpy as np   # Very slow for many datapoints.  Fastest for many costs, most readable def is_pareto_efficient_dumb(costs):     """     Find the pareto-efficient points     :param costs: An (n_points, n_costs) array     :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient     """     is_efficient = np.ones(costs.shape[0], dtype = bool)     for i, c in enumerate(costs):         is_efficient[i] = np.all(np.any(costs[:i]>c, axis=1)) and np.all(np.any(costs[i+1:]>c, axis=1))     return is_efficient   # Fairly fast for many datapoints, less fast for many costs, somewhat readable def is_pareto_efficient_simple(costs):     """     Find the pareto-efficient points     :param costs: An (n_points, n_costs) array     :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient     """     is_efficient = np.ones(costs.shape[0], dtype = bool)     for i, c in enumerate(costs):         if is_efficient[i]:             is_efficient[is_efficient] = np.any(costs[is_efficient]<c, axis=1)  # Keep any point with a lower cost             is_efficient[i] = True  # And keep self     return is_efficient   # Faster than is_pareto_efficient_simple, but less readable. def is_pareto_efficient(costs, return_mask = True):     """     Find the pareto-efficient points     :param costs: An (n_points, n_costs) array     :param return_mask: True to return a mask     :return: An array of indices of pareto-efficient points.         If return_mask is True, this will be an (n_points, ) boolean array         Otherwise it will be a (n_efficient_points, ) integer array of indices.     """     is_efficient = np.arange(costs.shape[0])     n_points = costs.shape[0]     next_point_index = 0  # Next index in the is_efficient array to search for     while next_point_index<len(costs):         nondominated_point_mask = np.any(costs<costs[next_point_index], axis=1)         nondominated_point_mask[next_point_index] = True         is_efficient = is_efficient[nondominated_point_mask]  # Remove dominated points         costs = costs[nondominated_point_mask]         next_point_index = np.sum(nondominated_point_mask[:next_point_index])+1     if return_mask:         is_efficient_mask = np.zeros(n_points, dtype = bool)         is_efficient_mask[is_efficient] = True         return is_efficient_mask     else:         return is_efficient

Profiling tests (using points drawn from a Normal distribution):

With 10,000 samples, 2 costs:

is_pareto_efficient_dumb: Elapsed time is 1.586s is_pareto_efficient_simple: Elapsed time is 0.009653s is_pareto_efficient: Elapsed time is 0.005479s

With 1,000,000 samples, 2 costs:

is_pareto_efficient_dumb: Really, really, slow is_pareto_efficient_simple: Elapsed time is 1.174s is_pareto_efficient: Elapsed time is 0.4033s

With 10,000 samples, 15 costs:

is_pareto_efficient_dumb: Elapsed time is 4.019s is_pareto_efficient_simple: Elapsed time is 6.466s is_pareto_efficient: Elapsed time is 6.41s

Note that if efficiency is a concern you can gain maybe a further 2x or so speedup by reordering your data beforehand, see here.

130

answered Oct 28 '22 02:10

Peter

Updated Aug 2019

Here is another simple implementation that is pretty fast for modest dimensions. Input points are assumed to be unique.

def keep_efficient(pts):     'returns Pareto efficient row subset of pts'     # sort points by decreasing sum of coordinates     pts = pts[pts.sum(1).argsort()[::-1]]     # initialize a boolean mask for undominated points     # to avoid creating copies each iteration     undominated = np.ones(pts.shape[0], dtype=bool)     for i in range(pts.shape[0]):         # process each point in turn         n = pts.shape[0]         if i >= n:             break         # find all points not dominated by i         # since points are sorted by coordinate sum         # i cannot dominate any points in 1,...,i-1         undominated[i+1:n] = (pts[i+1:] >= pts[i]).any(1)          # keep points undominated so far         pts = pts[undominated[:n]]     return pts

We start by sorting points according to the sum of coordinates. This is useful because

For many distributions of data, a point with a largest coordinate sum will dominate a large number of points.
If point x has a larger coordinate sum than point y, then y cannot dominate x.

Here are some benchmarks relative to Peter's answer, using np.random.randn.

N=10000 d=2  keep_efficient 1.31 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) is_pareto_efficient 6.51 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   N=10000 d=3  keep_efficient 2.3 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 16.4 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   N=10000 d=4  keep_efficient 4.37 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 21.1 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   N=10000 d=5  keep_efficient 15.1 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 110 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   N=10000 d=6  keep_efficient 40.1 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) is_pareto_efficient 279 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   N=10000 d=15  keep_efficient 3.92 s ± 125 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) is_pareto_efficient 5.88 s ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Convex Hull Heuristic

I ended up looking at this problem recently and found a useful heuristic that works well if there are many points distributed independently and dimensions are few.

The idea is to compute the convex hull of points. With few dimensions and independently distributed points, the number of vertices of the convex hull will be small. Intuitively, we can expect some vertices of the convex hull to dominate many of the original points. Moreover, if a point in a convex hull is not dominated by any other point in the convex hull, then it is also not dominated by any point in the original set.

This gives a simple iterative algorithm. We repeatedly

Compute the convex hull.
Save Pareto undominated points from the convex hull.
Filter the points to remove those dominated by elements of the convex hull.

I add a few benchmarks for dimension 3. It seems that for some distribution of points this approach yields better asymptotics.

import numpy as np from scipy import spatial from functools import reduce  # test points pts = np.random.rand(10_000_000, 3)   def filter_(pts, pt):     """     Get all points in pts that are not Pareto dominated by the point pt     """     weakly_worse   = (pts <= pt).all(axis=-1)     strictly_worse = (pts < pt).any(axis=-1)     return pts[~(weakly_worse & strictly_worse)]   def get_pareto_undominated_by(pts1, pts2=None):     """     Return all points in pts1 that are not Pareto dominated     by any points in pts2     """     if pts2 is None:         pts2 = pts1     return reduce(filter_, pts2, pts1)   def get_pareto_frontier(pts):     """     Iteratively filter points based on the convex hull heuristic     """     pareto_groups = []      # loop while there are points remaining     while pts.shape[0]:         # brute force if there are few points:         if pts.shape[0] < 10:             pareto_groups.append(get_pareto_undominated_by(pts))             break          # compute vertices of the convex hull         hull_vertices = spatial.ConvexHull(pts).vertices          # get corresponding points         hull_pts = pts[hull_vertices]          # get points in pts that are not convex hull vertices         nonhull_mask = np.ones(pts.shape[0], dtype=bool)         nonhull_mask[hull_vertices] = False         pts = pts[nonhull_mask]          # get points in the convex hull that are on the Pareto frontier         pareto   = get_pareto_undominated_by(hull_pts)         pareto_groups.append(pareto)          # filter remaining points to keep those not dominated by         # Pareto points of the convex hull         pts = get_pareto_undominated_by(pts, pareto)      return np.vstack(pareto_groups)  # -------------------------------------------------------------------------------- # previous solutions # --------------------------------------------------------------------------------  def is_pareto_efficient_dumb(costs):     """     :param costs: An (n_points, n_costs) array     :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient     """     is_efficient = np.ones(costs.shape[0], dtype = bool)     for i, c in enumerate(costs):         is_efficient[i] = np.all(np.any(costs>=c, axis=1))     return is_efficient   def is_pareto_efficient(costs):     """     :param costs: An (n_points, n_costs) array     :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient     """     is_efficient = np.ones(costs.shape[0], dtype = bool)     for i, c in enumerate(costs):         if is_efficient[i]:             is_efficient[is_efficient] = np.any(costs[is_efficient]<=c, axis=1)  # Remove dominated points     return is_efficient   def dominates(row, rowCandidate):     return all(r >= rc for r, rc in zip(row, rowCandidate))   def cull(pts, dominates):     dominated = []     cleared = []     remaining = pts     while remaining:         candidate = remaining[0]         new_remaining = []         for other in remaining[1:]:             [new_remaining, dominated][dominates(candidate, other)].append(other)         if not any(dominates(other, candidate) for other in new_remaining):             cleared.append(candidate)         else:             dominated.append(candidate)         remaining = new_remaining     return cleared, dominated  # -------------------------------------------------------------------------------- # benchmarking # --------------------------------------------------------------------------------  # to accomodate the original non-numpy solution pts_list = [list(pt) for pt in pts]  import timeit  # print('Old non-numpy solution:s\t{}'.format(     # timeit.timeit('cull(pts_list, dominates)', number=3, globals=globals())))  print('Numpy solution:\t{}'.format(     timeit.timeit('is_pareto_efficient(pts)', number=3, globals=globals())))  print('Convex hull heuristic:\t{}'.format(     timeit.timeit('get_pareto_frontier(pts)', number=3, globals=globals())))

Results

# >>= python temp.py # 1,000 points # Old non-numpy solution:      0.0316428339574486 # Numpy solution:              0.005961259012110531 # Convex hull heuristic:       0.012369581032544374 # >>= python temp.py # 1,000,000 points # Old non-numpy solution:      70.67529802105855 # Numpy solution:              5.398462114972062 # Convex hull heuristic:       1.5286884519737214 # >>= python temp.py # 10,000,000 points # Numpy solution:              98.03680767398328 # Convex hull heuristic:       10.203076395904645

Original Post

I took a shot at rewriting the same algorithm with a couple of tweaks. I think most of your problems come from inputPoints.remove(row). This requires searching through the list of points; removing by index would be much more efficient. I also modified the dominates function to avoid some redundant comparisons. This could be handy in a higher dimension.

def dominates(row, rowCandidate):     return all(r >= rc for r, rc in zip(row, rowCandidate))  def cull(pts, dominates):     dominated = []     cleared = []     remaining = pts     while remaining:         candidate = remaining[0]         new_remaining = []         for other in remaining[1:]:             [new_remaining, dominated][dominates(candidate, other)].append(other)         if not any(dominates(other, candidate) for other in new_remaining):             cleared.append(candidate)         else:             dominated.append(candidate)         remaining = new_remaining     return cleared, dominated

answered Oct 28 '22 02:10

hilberts_drinking_problem

Related questions
                            
                                Spyder 5 missing dependencies - spyder_kernels version error [closed]
                            
                                What does the ** maths operator do in Python?
                            
                                What is the best way to do automatic attribute assignment in Python, and is it a good idea?
                            
                                Automatically import models on Django shell launch
                            
                                Heroku & Django: "OSError: No such file or directory: '/app/{myappname}/static'"
                            
                                How can I pass parameters to a RequestHandler?
                            
                                How to activate different anaconda environment from powershell
                            
                                How do I set the content-type for POST requests in python-requests library?
                            
                                No module named 'tqdm'
                            
                                Using monotonically_increasing_id() for assigning row number to pyspark dataframe
                            
                                Read a file on App Engine with Python?
                            
                                Use fnmatch.filter to filter files by more than one possible file extension
                            
                                Python: Iterating through a dictionary gives me "int object not iterable"
                            
                                Can Pylint error checking be customized?
                            
                                Beautiful Soup find children for particular div
                            
                                How can I check the existence of attributes and tags in XML before parsing?
                            
                                Unpivot Pandas Data
                            
                                Using openpyxl to read file from memory
                            
                                How to remove parentheses and all data within using Pandas/Python?
                            
                                Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With