Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient data structure of objects in python for lookup based on any object member variable

I need to store objects that have a number (>2) of integer member variables and do quick look-ups using any member variables as a search key.

For easier illustration let's say the objects are tuples of 3 integers and I need to do quick look-ups using any element of a tuple as a key in a list of such tuples:

collection = [(1, 200, 9), 
              (2, 300, 8), 
              (3, 400, 7)]

Look-ups would be like:

collection.lookup_by_first_element(1) # Return (1, 200, 9)
collection.lookup_by_second_element(300) # Return (2, 300, 8)
collection.lookup_by_third_element(250) # Return None

I need the lookups to be fast/efficient. My best bet so far is to use an in memory sqlite database with three columns for the three tuple elements and put indexes on all three columns.

A search tree would also do, but those have one key for lookup and I don't see how could I do lookups based on more than one key. How would you do it?

like image 228
kadam Avatar asked May 22 '11 23:05

kadam


2 Answers

This is a simple solution. You could easily put this in a class and provide a neater interface.

>>> from collections import defaultdict
>>> collection = [(1, 200, 9),
...               (2, 300, 8),
...               (3, 400, 7)]
>>> keyed_dict = defaultdict(list)
>>> for tup in collection:
...     for i, e in enumerate(tup):
...         keyed_dict[(i, e)].append(tup)
... 
>>> keyed_dict[(1, 300)]
[(2, 300, 8)]

Update:

For what it's worth, the above is way faster than the numpy solution for lookup:

from timeit import timeit

setup_code = '''
import numpy

clen = {0}  # use .format() to fill this value
collection = [(n, (n + 1) * 100, clen - n) for n in xrange(clen)]

npcollection = numpy.array(collection)

def numpy_lookup(collection, column, value):
    if numpy.any(collection[:, column] == value): return collection[collection[:, column] == value]
    return 'None'

keyed_dict = dict()
for tup in collection:
    for i, e in enumerate(tup):
        keyed_dict[i, e] = tup
'''

for e in range(5):
    setup = setup_code.format(str(10 ** e))
    kd_stmt = '[keyed_dict[0, n] for n in range({0})]'.format(str(10 ** e))
    np_stmt = '[numpy_lookup(npcollection, 0, n) for n in range({0})]'.format(str(10 ** e))
    print 'using keyed_dict: ',
    print timeit(stmt=kd_stmt, setup=setup, number=10)
    print 'using numpy_lookup: ',
    print timeit(stmt=np_stmt.format(str(10 ** e)), setup=setup, number=10)

output:

using keyed_dict:  1.09672546387e-05
using numpy_lookup:  0.000250101089478
using keyed_dict:  3.00407409668e-05
using numpy_lookup:  0.00193691253662
using keyed_dict:  0.000190019607544
using numpy_lookup:  0.0199580192566
using keyed_dict:  0.00195384025574
using numpy_lookup:  0.317503929138
using keyed_dict:  0.0319399833679
using numpy_lookup:  15.0127439499
like image 50
senderle Avatar answered Oct 01 '22 11:10

senderle


senderle is right (I upvoted), but I want to elaborate (and more than just in a comment).

Dictionary lookups are O(1) and really fast (basically your key gets turned into a hash that addresses a specific location in memory to immediately retrieve your data from).

In contrast looking for a value by searching through an array is slow at least O(N), so for larger arrays it will take longer to find the right value. (E.g., you have to sift through all N keys, find the right one, and then can return the tuple.)

So if your keys are unique as you say, it makes sense to just create a large dictionary that is indexed based on every key you may use to lookup. Granted is you will have to represent each tuple of m items (m=3 in your case), m times in the dictionary while with a single array it only had to be represented once in the array.

So you want to define a Collection class

class Collection(object):
    def __init__(self, collection):
        self.collection_dict = dict()
        for tup in collection:
            for i, v in enumerate(tup):
                self.collection_dict[(i, v)] = tup
    def lookup_by_first_element(self, e):
        return self.collection_dict.get((0, e), None)
    def lookup_by_second_element(self, e):
        return self.collection_dict.get((1, e), None)
    def lookup_by_third_element(self, e):
        return self.collection_dict.get((2, e), None)


collection = [(1, 200, 9), 
              (2, 300, 8), 
              (3, 400, 7)]
c = Collection(collection)

The internal c.collection_dict is:

{(0, 1): (1, 200, 9),
 (0, 2): (2, 300, 8),
 (0, 3): (3, 400, 7),
 (1, 200): (1, 200, 9),
 (1, 300): (2, 300, 8),
 (1, 400): (3, 400, 7),
 (2, 7): (3, 400, 7),
 (2, 8): (2, 300, 8),
 (2, 9): (1, 200, 9)}

And your lookups work as you requested

>>> c.lookup_by_first_element(1) == (1, 200, 9)
True
>>> c.lookup_by_second_element(300) == (2, 300, 8)
True
>>> c.lookup_by_third_element(250) is None
True
like image 33
dr jimbob Avatar answered Oct 01 '22 13:10

dr jimbob