I have an array of 3D points:
a = np.array([[2., 3., 8.], [10., 4., 3.], [58., 3., 4.], [34., 2., 43.]])
How can I compute the geometric median of those points?
Given N points(in 2D) with x and y coordinates, find a point P (in N given points) such that the sum of distances from other(N-1) points to P is minimum. This point is commonly known as Geometric Median.
The geometric median is unique whenever the points are not collinear. The geometric median is equivariant for Euclidean similarity transformations, including translation and rotation.
Note: the geometric mean will not always equal the median, only in cases where there is an exact consistent multiplicative relationship between all numbers (e.g. multiplying each previous number by 3, as we did).
The spatial median, called the mediancentre in an earlier paper by J. C. Gower, is. defined as the bivariate location measure to minimize the sum of absolute distances to. observations.
I implemented Yehuda Vardi and Cun-Hui Zhang's algorithm for the geometric median, described in their paper "The multivariate L1-median and associated data depth". Everything is vectorized in numpy, so should be very fast. I didn't implement weights - only unweighted points.
import numpy as np
from scipy.spatial.distance import cdist, euclidean
def geometric_median(X, eps=1e-5):
y = np.mean(X, 0)
while True:
D = cdist(X, [y])
nonzeros = (D != 0)[:, 0]
Dinv = 1 / D[nonzeros]
Dinvs = np.sum(Dinv)
W = Dinv / Dinvs
T = np.sum(W * X[nonzeros], 0)
num_zeros = len(X) - np.sum(nonzeros)
if num_zeros == 0:
y1 = T
elif num_zeros == len(X):
return y
else:
R = (T - y) * Dinvs
r = np.linalg.norm(R)
rinv = 0 if r == 0 else num_zeros/r
y1 = max(0, 1-rinv)*T + min(1, rinv)*y
if euclidean(y, y1) < eps:
return y1
y = y1
In addition to the default SO license terms, I release the code above under the zlib license, if you so prefer.
The calculation of the geometric median with the Weiszfeld's iterative algorithm is implemented in Python in this gist or in the function below copied from the OpenAlea software (CeCILL-C license),
import numpy as np
import math
import warnings
def geometric_median(X, numIter = 200):
"""
Compute the geometric median of a point sample.
The geometric median coordinates will be expressed in the Spatial Image reference system (not in real world metrics).
We use the Weiszfeld's algorithm (http://en.wikipedia.org/wiki/Geometric_median)
:Parameters:
- `X` (list|np.array) - voxels coordinate (3xN matrix)
- `numIter` (int) - limit the length of the search for global optimum
:Return:
- np.array((x,y,z)): geometric median of the coordinates;
"""
# -- Initialising 'median' to the centroid
y = np.mean(X,1)
# -- If the init point is in the set of points, we shift it:
while (y[0] in X[0]) and (y[1] in X[1]) and (y[2] in X[2]):
y+=0.1
convergence=False # boolean testing the convergence toward a global optimum
dist=[] # list recording the distance evolution
# -- Minimizing the sum of the squares of the distances between each points in 'X' and the median.
i=0
while ( (not convergence) and (i < numIter) ):
num_x, num_y, num_z = 0.0, 0.0, 0.0
denum = 0.0
m = X.shape[1]
d = 0
for j in range(0,m):
div = math.sqrt( (X[0,j]-y[0])**2 + (X[1,j]-y[1])**2 + (X[2,j]-y[2])**2 )
num_x += X[0,j] / div
num_y += X[1,j] / div
num_z += X[2,j] / div
denum += 1./div
d += div**2 # distance (to the median) to miminize
dist.append(d) # update of the distance evolution
if denum == 0.:
warnings.warn( "Couldn't compute a geometric median, please check your data!" )
return [0,0,0]
y = [num_x/denum, num_y/denum, num_z/denum] # update to the new value of the median
if i > 3:
convergence=(abs(dist[i]-dist[i-2])<0.1) # we test the convergence over three steps for stability
#~ print abs(dist[i]-dist[i-2]), convergence
i += 1
if i == numIter:
raise ValueError( "The Weiszfeld's algoritm did not converged after"+str(numIter)+"iterations !!!!!!!!!" )
# -- When convergence or iterations limit is reached we assume that we found the median.
return np.array(y)
Alternatively, you could use the C implementation, mentionned in this answer, and interface it to python with, for instance, ctypes
.
The problem can be easily approximated with minimize
module in scipy
. In this module, it provides various optimization algorithms, from nelder-mead to newton-CG. Nelder-mead algorithm is particular useful if you do not want to bother with high order derivatives, at a cost of losing some precision. Nevertheless, you just need to know the function to be minimized for nelder-mead algorithm to work.
Now, referring to the same array in the questions, if we use @orlp's method, we will get this:
geometric_median(a)
# array([12.58942481, 3.51573852, 7.28710661])
For Nelder-mead method, you will see below. The function to be minimized is the distance function from all points i.e.
Here is the code:
from scipy.optimize import minimize
x = [point[0] for point in a]
y = [point[1] for point in a]
z = [point[2] for point in a]
x0 = np.array([sum(x)/len(x),sum(y)/len(y), sum(z)/len(z)])
def dist_func(x0):
return sum(((np.full(len(x),x0[0])-x)**2+(np.full(len(x),x0[1])-y)**2+(np.full(len(x),x0[2])-z)**2)**(1/2))
res = minimize(dist_func, x0, method='nelder-mead', options={'xtol': 1e-8, 'disp': True})
res.x
# array([12.58942487, 3.51573846, 7.28710679])
Note that I use to mean of all points as the initial values for the alogrithm. The result is quite close to @orlp's method, which is more accurate. As I mentioned, you sacrifice a bit but still get quite good approximates.
Performance of Nelder Mead Algorithm
For this, I generated a test_array
with 10000 entries of points from normal distribution centred at 3.2. Therefore, the geometric median should be quite close to [3.2, 3.2, 3.2].
np.random.seed(3)
test_array = np.array([[np.random.normal(3.2,20),
np.random.normal(3.2,20),
np.random.normal(3.2,20)] for i in np.arange(10000)])
For @orlp's method,
%timeit geometric_median(test_array)
# 12.1 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# array([2.95151061, 3.14098477, 3.01468281])
For Nelder mead,
%timeit res.x
# 565 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# array([2.95150898, 3.14098468, 3.01468276])
@orlp's method is very fast while Nelder mead is not bad. However, Nelder mead method is generic whereas @orlp's is specific to geometric median. The method you would like to choose would depend on your purpose. If you just want an approximate, I will choose Nelder at ease. If you want to be exact, @orlp's method is both faster and more accurate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With