Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find integer nearest-neighbour in a dict

I have a dict that takes integer keys:

a = {}
a[1] = 100
a[55] = 101
a[127] = 102

I would like to be able to take the nearest neighbour when asking :

a[20] # should return a[1] = 100
a[58] # should return a[55] = 101
a[167] # should return a[127] = 102

Is there a pythonic way of doing this? (I imagine this can be done by looping on all dict, but that's probably not the most elegant solution?)


Same question with double-index (integers as well) :

 b[90, 1] = 100, b[90, 55] = 101, b[90, 127] = 102
 b[70, 1] = 40, b[70, 45] = 41, b[70, 107] = 42

I would like to be able to get b[73, 40] = b[70, 45] = 41, i.e. nearest neighboor in a 2D plane.

like image 840
Basj Avatar asked Jan 09 '23 17:01

Basj


1 Answers

Update: After benchmarking the two approaches in this answer, the second approach is significantly better, to the point that it should almost be strictly preferred.


The following approach handles n-dimensions identically:

class NearestDict(dict):
    def __init__(self, ndims):
        super(NearestDict, self).__init__()
        self.ndims = ndims

    # Enforce dimensionality
    def __setitem__(self, key, val):
        if not isinstance(key, tuple): key = (key,)
        if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
        super(NearestDict, self).__setitem__(key, val)

    @staticmethod
    def __dist(ka, kb):
        assert len(ka) == len(kb)
        return sum((ea-eb)**2 for (ea, eb) in zip(ka, kb))

    # Helper method and might be of use
    def nearest_key(self, key):
        if not isinstance(key, tuple): key = (key,)
        nk = min((k for k in self), key=lambda k: NearestDict.__dist(key, k))
        return nk

    def __missing__(self, key):
        if not isinstance(key, tuple): key = (key,)
        if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
        return self[self.nearest_key(key)]

Demo:

a = NearestDict(1)
a[1] = 100
a[55] = 101
a[127] = 102
print a[20]    # 100
print a[58]    # 100
print a[167]   # 102
print a.nearest_key(20)     # (1,)
print a.nearest_key(58)     # (55,)
print a.nearest_key(127)    # (127,)

b = NearestDict(2)
b[90, 1]   = 100
b[90, 55]  = 101
b[90, 127] = 102
b[70, 1]   = 40
b[70, 45]  = 41
b[70, 107] = 42
print b[73, 40] # 41
print b.nearest_key((73,40)) # (70, 45)

Note that if the key exists, the lookup is no slower than a standard dictionary lookup. If the key does not exist, you compute the distance between every existing key. Nothing is cached, although you could tack that on I suppose.

Edit:

From the approach suggested by Kasra's answer the following approach implements the same class as above using scipy's cKDTree:

Note that there is an additional optional argument, regenOnAdd that will allow you to defer (re)-building the KDTree until after you have completed (the majority of) your inserts:

from scipy.spatial import cKDTree

class KDDict(dict):
    def __init__(self, ndims, regenOnAdd=False):
        super(KDDict, self).__init__()
        self.ndims = ndims
        self.regenOnAdd = regenOnAdd
        self.__keys = []
        self.__tree = None
        self.__stale = False

    # Enforce dimensionality
    def __setitem__(self, key, val):
        if not isinstance(key, tuple): key = (key,)
        if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
        self.__keys.append(key)
        self.__stale = True
        if self.regenOnAdd: self.regenTree()
        super(KDDict, self).__setitem__(key, val)

    def regenTree(self):
        self.__tree = cKDTree(self.__keys)
        self.__stale = False

    # Helper method and might be of use
    def nearest_key(self, key):
        if not isinstance(key, tuple): key = (key,)
        if self.__stale: self.regenTree()
        _, idx = self.__tree.query(key, 1)
        return self.__keys[idx]

    def __missing__(self, key):
        if not isinstance(key, tuple): key = (key,)
        if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
        return self[self.nearest_key(key)]

Output is the same as the above approach.

Benchmark Results

To understand the performance of the three approaches (NearestDict, KDDict(True) (regen on insert), and KDDict(False) (defer regen)), I briefly benchmarked them.

I ran 3 different tests. The parameters that stayed the same across the tests were:

  • Number of Test Iterations: I did each test 5 times and took the minimum time. (Note timeit.repeat defaults to 3).
  • Point boundaries: I generated integer key points in the range 0 <= x < 1000
  • Number of lookups: I timed inserts and lookups separately. The three tests below all use 10,000 lookups.

The first test used keys of 4 dimensions, and 1,000 insertions.

{'NDIMS': 4, 'NITER': 5, 'NELEMS': 1000, 'NFINDS': 10000, 'DIM_LB': 0, 'DIM_UB': 1000, 'SCORE_MUL': 100}
insert::NearestDict       0.125
insert::KDDict(regen)    35.957
insert::KDDict(defer)     0.174
search::NearestDict    2636.965
search::KDDict(regen)    49.965
search::KDDict(defer)    51.880

The second test used keys of 4 dimensions and 100 insertions. I wanted to vary the number of insertions to see how well the two approaches performed as the density of the dictionary varied.

{'NDIMS': 4, 'NITER': 5, 'NELEMS': 100, 'NFINDS': 10000, 'DIM_LB': 0, 'DIM_UB': 1000, 'SCORE_MUL': 100}
insert::NearestDict       0.013
insert::KDDict(regen)     0.629
insert::KDDict(defer)     0.018
search::NearestDict     247.920
search::KDDict(regen)    44.523
search::KDDict(defer)    44.718

The third test used 100 insertions (like the second test) but 12 dimensions. I wanted to see how the approaches performed as key dimensionality increased.

{'NDIMS': 12, 'NITER': 5, 'NELEMS': 100, 'NFINDS': 10000, 'DIM_LB': 0, 'DIM_UB': 1000, 'SCORE_MUL': 100}
insert::NearestDict       0.013
insert::KDDict(regen)     0.722
insert::KDDict(defer)     0.017
search::NearestDict     405.092
search::KDDict(regen)    49.046
search::KDDict(defer)    50.601

Discussion

KDDict with continuous regeneration (KDDict(True)) is either fractionally faster (in lookup) or considerably slower (in insertion). Because of this, I'm leaving it out of the discussion and focusing on NearestDict and KDDict(False), now referred to as simply KDDict

The results were surprisingly in favor of KDDict with deferred regeneration.

For insertion, in all cases, KDDict performed slightly worse than NearestDict. This was to be expected because of the additional list append operation.

For search, in all cases, KDDict performed significantly better than NearestDict.

As sparsity of the dictionary decreased / density increased, NearestDict's performance decreased to a much greater extent than KDDict. When going from 100 keys to 1000 keys, NearestDict search time increased by 9.64x while KDDict search time increased by only 0.16x.

As the dimensionality of the dictionary was increased, NearestDict's performance decreased to a greater than extent than KDDict. When going from 4 to 12 dimensions, NearestDict search time increased by 0.64x while KDDict search time increased by only 0.13x.

In light of this, and the relatively equal complexity of the two classes, if you have access to the scipy toolkit, using the KDDict approach is strongly recommended.

like image 71
jedwards Avatar answered Jan 13 '23 08:01

jedwards