Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reversible dictionary for python

I'd like to store some data in Python in a similar form to a dictionary: {1:'a', 2:'b'}. Every value will be unique, not just among other values, but among keys too.

Is there a simple data structure that I can use to get the corresponding object no matter if I ask using the 'key' or the 'value'? For example:

>>> a = {1:'a', 2:'b'}
>>> a[1]
'a'
>>> a['b']
2
>>> a[3]
KeyError

The 'keys' are standard python ints, an the values are short (<256char) strings.

My current solution is creating a reversed dictionary and searching it if I can't find a result in the original dictionary:

pointsreversed = dict((v, k) for k, v in points.iteritems())
def lookup(key):
    return points.get(key) or pointsreversed.key()

This uses twice as much space, which isn't great (my dictionaries can be up to a few hundred megs) and is 50% slower on average.

EDIT: as mentioned in a few answers, two dicts doesn't double memory usage, as it's only the dictionary, not the items within, that is duplication.

Is there a solution that improves on this?

like image 315
Alex J Avatar asked Jun 30 '09 12:06

Alex J


2 Answers

If your keys and values are non-overlapping, one obvious approach is to simply store them in the same dict. ie:

class BidirectionalDict(dict):
    def __setitem__(self, key, val):
        dict.__setitem__(self, key, val)
        dict.__setitem__(self, val, key)

    def __delitem__(self, key):
        dict.__delitem__(self, self[key])
        dict.__delitem__(self, key)

d = BidirectionalDict()
d['foo'] = 4
print d[4]   # Prints 'foo'

(You'll also probably want to implement things like the __init__, update and iter* methods to act like a real dict, depending on how much functionality you need).

This should only involve one lookup, though may not save you much in memory (you still have twice the number of dict entries after all). Note however that neither this nor your original will use up twice as much space: the dict only takes up space for the references (effectively pointers), plus an overallocation overhead. The space taken up by your data itself will not be repeated twice since the same objects are pointed to.

like image 108
Brian Avatar answered Nov 01 '22 07:11

Brian


In The Art of Computer Programming, Vokume 3 Knuth has a section on lookups of secondary keys. For purposes of your question, the value could be considered the secondary key.

The first suggestion is to do what you have done: make an efficient index of the keys by value.

The second suggestion is to setup a large btree that is a composite index of the clustered data, where the branch nodes contain values and the leaves contain the key data and pointers to the larger record (if there is one.)

If the data is geometric (as yours appears to be) there are things called post-office trees. It can answer questions like, what is the nearest object to point x. A few examples are here: http://simsearch.yury.name/russir/01nncourse-hand.pdf Another simple option for this kind of query is the quadtree and the k-d tree. http://en.wikipedia.org/wiki/Quadtree

Another final option is combinatorial hashing, where you combine the key and value into a special kind of hash that lets you do efficient lookups on the hash, even when you don't have both values. I couldn't find a good combinatorial hash explanation online, but it is in TAoCP, Volume 3 Second Edition on page 573.

Granted, for some of these you may have to write your own code. But if memory or performance is really key, you might want to take the time.

like image 32
Christopher Avatar answered Nov 01 '22 08:11

Christopher