I have a collection of N
points in three dimensions. These are stored as an np.array
with a shape of (N,3)
. All of the points are distinct with the minimum distance between any two points being ~1e-5
. I am looking for a means of obtaining an order in which to iterate over these points which is both independent of their current order in the np.array
and robust to small perturbations of individual components.
The simplest means of satisfying the first requirement is with np.lexsort
with
np.lexsort(my_array.T)
however this fails in the robustness department:
In [6]: my_array = np.array([[-0.5, 0, 2**0.5], [0.5, 0, 2**0.5 - 1e-15]])
In [7]: my_array[np.lexsort(my_array.T)]
Out[7]:
array([[ 0.5 , 0. , 1.41421356],
[-0.5 , 0. , 1.41421356]])
where we can see that in this instance the ordering is extremely sensitive to perturbations. I am therefore looking for a fuzzy variant of np.lexsort
which will move onto the next axis if two values in one axis are within a tolerance of epsilon
. (Or any alternative mechanism which will permit me to obtain an ordering.)
As my application has several million of these collections, all of which need ordering, performance is something of a concern (which is why I have not blindly tried to roll my own tolerant np.lexsort without first seeing if there is a better way to do it).
My eventual solution was:
def fuzzysort(arr, idx, dim=0, tol=1e-6):
# Extract our dimension and argsort
arrd = arr[dim]
srtdidx = sorted(idx, key=arrd.__getitem__)
i, ix = 0, srtdidx[0]
for j, jx in enumerate(srtdidx[1:], start=1):
if arrd[jx] - arrd[ix] >= tol:
if j - i > 1:
srtdidx[i:j] = fuzzysort(arr, srtdidx[i:j], dim + 1, tol)
i, ix = j, jx
if i != j:
srtdidx[i:] = fuzzysort(arr, srtdidx[i:], dim + 1, tol)
return srtdidx
I note that this is slightly over-engineered for the problem described above. As with np.lexsort
the array must be passed in transposed form. The idx
parameter permits one to control what indices are considered (allowing for elements to be crudely masked). Otherwise list(xrange(0, N))
will do.
Performance isn't great. However, this is mostly a consequence of NumPy scalar types behaving badly. Calling tolist()
on the array beforehand improves the situation somewhat.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With