Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Geometric median of multidimensional points

I have an array of 3D points:

a = np.array([[2., 3., 8.], [10., 4., 3.], [58., 3., 4.], [34., 2., 43.]])

How can I compute the geometric median of those points?

like image 370
orlp Avatar asked May 18 '15 09:05

orlp


People also ask

How to find geometric median of n points?

Given N points(in 2D) with x and y coordinates, find a point P (in N given points) such that the sum of distances from other(N-1) points to P is minimum. This point is commonly known as Geometric Median.

Is geometric median unique?

The geometric median is unique whenever the points are not collinear. The geometric median is equivariant for Euclidean similarity transformations, including translation and rotation.

Is geometric mean the same as median?

Note: the geometric mean will not always equal the median, only in cases where there is an exact consistent multiplicative relationship between all numbers (e.g. multiplying each previous number by 3, as we did).

What is spatial median?

The spatial median, called the mediancentre in an earlier paper by J. C. Gower, is. defined as the bivariate location measure to minimize the sum of absolute distances to. observations.


3 Answers

I implemented Yehuda Vardi and Cun-Hui Zhang's algorithm for the geometric median, described in their paper "The multivariate L1-median and associated data depth". Everything is vectorized in numpy, so should be very fast. I didn't implement weights - only unweighted points.

import numpy as np
from scipy.spatial.distance import cdist, euclidean

def geometric_median(X, eps=1e-5):
    y = np.mean(X, 0)

    while True:
        D = cdist(X, [y])
        nonzeros = (D != 0)[:, 0]

        Dinv = 1 / D[nonzeros]
        Dinvs = np.sum(Dinv)
        W = Dinv / Dinvs
        T = np.sum(W * X[nonzeros], 0)

        num_zeros = len(X) - np.sum(nonzeros)
        if num_zeros == 0:
            y1 = T
        elif num_zeros == len(X):
            return y
        else:
            R = (T - y) * Dinvs
            r = np.linalg.norm(R)
            rinv = 0 if r == 0 else num_zeros/r
            y1 = max(0, 1-rinv)*T + min(1, rinv)*y

        if euclidean(y, y1) < eps:
            return y1

        y = y1

In addition to the default SO license terms, I release the code above under the zlib license, if you so prefer.

like image 86
orlp Avatar answered Oct 07 '22 17:10

orlp


The calculation of the geometric median with the Weiszfeld's iterative algorithm is implemented in Python in this gist or in the function below copied from the OpenAlea software (CeCILL-C license),

import numpy as np
import math
import warnings

def geometric_median(X, numIter = 200):
    """
    Compute the geometric median of a point sample.
    The geometric median coordinates will be expressed in the Spatial Image reference system (not in real world metrics).
    We use the Weiszfeld's algorithm (http://en.wikipedia.org/wiki/Geometric_median)

    :Parameters:
     - `X` (list|np.array) - voxels coordinate (3xN matrix)
     - `numIter` (int) - limit the length of the search for global optimum

    :Return:
     - np.array((x,y,z)): geometric median of the coordinates;
    """
    # -- Initialising 'median' to the centroid
    y = np.mean(X,1)
    # -- If the init point is in the set of points, we shift it:
    while (y[0] in X[0]) and (y[1] in X[1]) and (y[2] in X[2]):
        y+=0.1

    convergence=False # boolean testing the convergence toward a global optimum
    dist=[] # list recording the distance evolution

    # -- Minimizing the sum of the squares of the distances between each points in 'X' and the median.
    i=0
    while ( (not convergence) and (i < numIter) ):
        num_x, num_y, num_z = 0.0, 0.0, 0.0
        denum = 0.0
        m = X.shape[1]
        d = 0
        for j in range(0,m):
            div = math.sqrt( (X[0,j]-y[0])**2 + (X[1,j]-y[1])**2 + (X[2,j]-y[2])**2 )
            num_x += X[0,j] / div
            num_y += X[1,j] / div
            num_z += X[2,j] / div
            denum += 1./div
            d += div**2 # distance (to the median) to miminize
        dist.append(d) # update of the distance evolution

        if denum == 0.:
            warnings.warn( "Couldn't compute a geometric median, please check your data!" )
            return [0,0,0]

        y = [num_x/denum, num_y/denum, num_z/denum] # update to the new value of the median
        if i > 3:
            convergence=(abs(dist[i]-dist[i-2])<0.1) # we test the convergence over three steps for stability
            #~ print abs(dist[i]-dist[i-2]), convergence
        i += 1
    if i == numIter:
        raise ValueError( "The Weiszfeld's algoritm did not converged after"+str(numIter)+"iterations !!!!!!!!!" )
    # -- When convergence or iterations limit is reached we assume that we found the median.

    return np.array(y)

Alternatively, you could use the C implementation, mentionned in this answer, and interface it to python with, for instance, ctypes.

like image 45
rth Avatar answered Oct 07 '22 17:10

rth


The problem can be easily approximated with minimize module in scipy. In this module, it provides various optimization algorithms, from nelder-mead to newton-CG. Nelder-mead algorithm is particular useful if you do not want to bother with high order derivatives, at a cost of losing some precision. Nevertheless, you just need to know the function to be minimized for nelder-mead algorithm to work.

Now, referring to the same array in the questions, if we use @orlp's method, we will get this:

geometric_median(a)
# array([12.58942481,  3.51573852,  7.28710661])

For Nelder-mead method, you will see below. The function to be minimized is the distance function from all points i.e.

a formula

Here is the code:

from scipy.optimize import minimize
x = [point[0] for point in a]
y = [point[1] for point in a]
z = [point[2] for point in a]

x0 = np.array([sum(x)/len(x),sum(y)/len(y), sum(z)/len(z)])
def dist_func(x0):
    return sum(((np.full(len(x),x0[0])-x)**2+(np.full(len(x),x0[1])-y)**2+(np.full(len(x),x0[2])-z)**2)**(1/2))
res = minimize(dist_func, x0, method='nelder-mead', options={'xtol': 1e-8, 'disp': True})
res.x
# array([12.58942487,  3.51573846,  7.28710679])

Note that I use to mean of all points as the initial values for the alogrithm. The result is quite close to @orlp's method, which is more accurate. As I mentioned, you sacrifice a bit but still get quite good approximates.

Performance of Nelder Mead Algorithm For this, I generated a test_array with 10000 entries of points from normal distribution centred at 3.2. Therefore, the geometric median should be quite close to [3.2, 3.2, 3.2].

np.random.seed(3)
test_array = np.array([[np.random.normal(3.2,20),
                        np.random.normal(3.2,20),
                        np.random.normal(3.2,20)] for i in np.arange(10000)])

For @orlp's method,

%timeit geometric_median(test_array)
# 12.1 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# array([2.95151061, 3.14098477, 3.01468281])

For Nelder mead,

%timeit res.x
# 565 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# array([2.95150898, 3.14098468, 3.01468276])

@orlp's method is very fast while Nelder mead is not bad. However, Nelder mead method is generic whereas @orlp's is specific to geometric median. The method you would like to choose would depend on your purpose. If you just want an approximate, I will choose Nelder at ease. If you want to be exact, @orlp's method is both faster and more accurate.

like image 3
shiba Avatar answered Oct 07 '22 19:10

shiba