Build an approximately uniform grid from random sample (python)

Tags:

I want to build a grid from sampled data. I could use a machine learning - clustering algorithm, like k-means, but I want to restrict the centres to be roughly uniformly distributed.

I have come up with an approach using the scikit-learn nearest neighbours search: pick a point at random, delete all points within radius r then repeat. This works well, but wondering if anyone has a better (faster) way of doing this.

In response to comments I have tried two alternate methods, one turns out much slower the other is about the same...

Method 0 (my first attempt):

def get_centers0(X, r): 

    N = X.shape[0]
    D = X.shape[1]
    grid = np.zeros([0,D])
    nearest = near.NearestNeighbors(radius = r, algorithm = 'auto')

    while N > 0:
        nearest.fit(X)
        x = X[int(random()*N), :]
        _, del_x = nearest.radius_neighbors(x)
        X = np.delete(X, del_x[0], axis = 0)
        grid = np.vstack([grid, x])
        N = X.shape[0]

    return grid

Method 1 (using the precomputed graph):

def get_centers1(X, r): 

    N = X.shape[0]
    D = X.shape[1]
    grid = np.zeros([0,D])
    nearest = near.NearestNeighbors(radius = r, algorithm = 'auto')
    nearest.fit(X)
    graph = nearest.radius_neighbors_graph(X)

    #This method is very slow even before doing any 'pruning'

Method 2:

def get_centers2(X, r, k): 

    N = X.shape[0]
    D = X.shape[1]
    k = k
    grid = np.zeros([0,D])
    nearest = near.NearestNeighbors(radius = r, algorithm = 'auto')

    while N > 0:
        nearest.fit(X)
        x = X[np.random.randint(0,N,k), :]

        #min_dist = near.NearestNeighbors().fit(x).kneighbors(x, n_neighbors = 1, return_distance = True)
        min_dist = dist(x, k, 2, np.ones(k)) # where dist is a cython compiled function
        x = x[min_dist < 0.1,:]

        _, del_x = nearest.radius_neighbors(x)
        X = np.delete(X, del_x[0], axis = 0)
        grid = np.vstack([grid, x])
        N = X.shape[0]

    return grid

Running these as follows:

N = 50000
r = 0.1
x1 = np.random.rand(N)
x2 = np.random.rand(N)
X = np.vstack([x1, x2]).T

tic = time.time()
grid0 = get_centers0(X, r)
toc = time.time()
print 'Method 0: ' + str(toc - tic)

tic = time.time()
get_centers1(X, r)
toc = time.time()
print 'Method 1: ' + str(toc - tic)

tic = time.time()
grid2 = get_centers2(X, r)
toc = time.time()
print 'Method 1: ' + str(toc - tic)

Method 0 and 2 are about the same...

Method 0: 0.840130090714
Method 1: 2.23365592957
Method 2: 0.774812936783

240

asked Oct 30 '13 09:10

Neal Hughes

3 Answers

I'm not sure from the question exactly what you are trying to do. You mention wanting to create an "approximate grid", or a "uniform distribution", while the code you provide selects a subset of points such that no pairwise distance is greater than r.

A couple possible suggestions:

if what you want is an approximate grid, I would construct the grid you want to approximate, and then query for the nearest neighbor of each grid point. Depending on your application, you might further trim these results to cut-out points whose distance from the grid point is larger than is useful for you.
if what you want is an approximately uniform distribution drawn from among the points, I would do a kernel density estimate (sklearn.neighbors.KernelDensity) at each point, and do a randomized sub-selection from the dataset weighted by the inverse of the local density at each point.
if what you want is a subset of points such that no pairwise distance is greater than r, I would start by constructing a radius_neighbors_graph with radius r, which will, in one go, give you a list of all points which are too close together. You can then use a pruning algorithm similar to the one you wrote above to remove points based on these sparse graph distances.

I hope that helps!

166

answered Sep 23 '22 14:09

jakevdp

I have come up with a very simple method which is much more efficient than my previous attempts.

This one simply loops over the data set and adds the current point to the list of grid points only if it is greater than r distance from all existing centers. This method is around 20 times faster than my previous attempts. Because there are no external libraries involved I can run this all in cython...

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def get_centers_fast(np.ndarray[DTYPE_t, ndim = 2] x, double radius):

    cdef int N = x.shape[0]
    cdef int D = x.shape[1]
    cdef int m = 1
    cdef np.ndarray[DTYPE_t, ndim = 2] xc = np.zeros([10000, D])
    cdef double r = 0
    cdef double r_min = 10
    cdef int i, j, k

    for k in range(D):
        xc[0,k] = x[0,k]

    for i in range(1, N):
        r_min = 10
        for j in range(m):
            r = 0
            for k in range(D):
                r += (x[i, k] - xc[j, k])**2
            r = r**0.5
            if r < r_min:
                r_min = r
        if r_min > radius:
            m = m + 1
            for k in range(D):
                xc[m - 1,k] = x[i,k]

    nonzero = np.nonzero(xc[:,0])[0]
    xc = xc[nonzero,:]

    return xc

Running these methods as follows:

N = 40000
r = 0.1
x1 = np.random.normal(size = N)
x1 = (x1 - min(x1)) / (max(x1)-min(x1))
x2 = np.random.normal(size = N)
x2 = (x2 - min(x2)) / (max(x2)-min(x2))
X = np.vstack([x1, x2]).T

tic = time.time()
grid0 = gt.get_centers0(X, r)
toc = time.time()
print 'Method 0: ' + str(toc - tic)

tic = time.time()
grid2 = gt.get_centers2(X, r, 10)
toc = time.time()
print 'Method 2: ' + str(toc - tic)

tic = time.time()
grid3 = gt.get_centers_fast(X, r)
toc = time.time()
print 'Method 3: ' + str(toc - tic)

The new method is around 20 times faster. It could be made even faster, if I stopped looping early (e.g. if k successive iterations fail to produce a new center).

Method 0: 0.219595909119
Method 2: 0.191949129105
Method 3: 0.0127329826355

answered Sep 22 '22 14:09

Neal Hughes

Maybe you could only re-fit the nearest object every k << N deletions to speedup the process. Most of the time the neighborhood structure should not change much.

answered Sep 21 '22 14:09

ogrisel

Related questions
                            
                                How to show result of python script in a console when running from ST editor?
                            
                                initializing members of a list
                            
                                Python: Making a read-only property accessible via **vars(some_class)
                            
                                Admin FileField current url incorrect
                            
                                Storing and printing an exception with traceback?
                            
                                detect/remove unpaired surrogate character in Python 2 + GTK
                            
                                "Standard" way of creating config file suitable for Python and Java together
                            
                                Concatenate strings if they have an overlapping region
                            
                                Displaying a DecimalField with currency in the django admin
                            
                                numpy array of strings indexing behavior
                            
                                using an alternate /tmp location with pyinstaller
                            
                                Plot cross section through heat map
                            
                                Python Tkinter Spinbox with float
                            
                                How to pass arguments from tornado to a js file but not html?
                            
                                How to save pygame Surface as an image to memory (and not to disk)
                            
                                PyInstaller very big file size
                            
                                How to plot a probability mass function in python
                            
                                Create a multiframe .tif file
                            
                                Pandas GroupBy by Element and everything else
                            
                                matplotlib: Plot numpy arrays with None as values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Build an approximately uniform grid from random sample (python)

Tags:

python

machine-learning

cluster-analysis

scikit-learn

Neal Hughes

People also ask

3 Answers

jakevdp

Neal Hughes

ogrisel

Recent Activity

Donate For Us