Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy: np.lexsort with fuzzy/tolerant comparisons

I have a collection of N points in three dimensions. These are stored as an np.array with a shape of (N,3). All of the points are distinct with the minimum distance between any two points being ~1e-5. I am looking for a means of obtaining an order in which to iterate over these points which is both independent of their current order in the np.array and robust to small perturbations of individual components.

The simplest means of satisfying the first requirement is with np.lexsort with

np.lexsort(my_array.T)

however this fails in the robustness department:

In [6]: my_array = np.array([[-0.5, 0, 2**0.5], [0.5, 0, 2**0.5 - 1e-15]])

In [7]: my_array[np.lexsort(my_array.T)]
Out[7]: 
array([[ 0.5       ,  0.        ,  1.41421356],
       [-0.5       ,  0.        ,  1.41421356]])

where we can see that in this instance the ordering is extremely sensitive to perturbations. I am therefore looking for a fuzzy variant of np.lexsort which will move onto the next axis if two values in one axis are within a tolerance of epsilon. (Or any alternative mechanism which will permit me to obtain an ordering.)

As my application has several million of these collections, all of which need ordering, performance is something of a concern (which is why I have not blindly tried to roll my own tolerant np.lexsort without first seeing if there is a better way to do it).

like image 907
Freddie Witherden Avatar asked Sep 28 '13 22:09

Freddie Witherden


1 Answers

My eventual solution was:

def fuzzysort(arr, idx, dim=0, tol=1e-6):
    # Extract our dimension and argsort
    arrd = arr[dim]
    srtdidx = sorted(idx, key=arrd.__getitem__)

    i, ix = 0, srtdidx[0]
    for j, jx in enumerate(srtdidx[1:], start=1):
        if arrd[jx] - arrd[ix] >= tol:
            if j - i > 1:
                srtdidx[i:j] = fuzzysort(arr, srtdidx[i:j], dim + 1, tol)
            i, ix = j, jx

    if i != j:
        srtdidx[i:] = fuzzysort(arr, srtdidx[i:], dim + 1, tol)

    return srtdidx

I note that this is slightly over-engineered for the problem described above. As with np.lexsort the array must be passed in transposed form. The idx parameter permits one to control what indices are considered (allowing for elements to be crudely masked). Otherwise list(xrange(0, N)) will do.

Performance isn't great. However, this is mostly a consequence of NumPy scalar types behaving badly. Calling tolist() on the array beforehand improves the situation somewhat.

like image 50
Freddie Witherden Avatar answered Nov 05 '22 22:11

Freddie Witherden