Optimising partial dictionary key match

Tags:

I have a dictionary which uses a 4-tuple as it's key. I need to find all the keys in the dictionary which partially match some other tuple. I have some code which does this but it's slow and needs optimizing.

Here's what I'm after:

Keys:
(1, 2, 3, 4)
(1, 3, 5, 2)
(2, 4, 8, 7)
(1, 4, 3, 4)
Match:
(1, None, 3, None)
Result:
[(1, 2, 3, 4), (1, 4, 3, 4)]

Current code:

def GetTuples(self, keyWords):
    tuples = []
    for k in self.chain.iterkeys():
        match = True
        for i in range(self.order):
            if keyWords[i] is not None and keyWords[i] != k[i]:
                match = False
                break
        if match is True:
            tuples.append(k)
    return tuples

keyWords is a list containing the values I want to match
self.chain is the dictionary
self.order is the size of the tuple
len(keyWords) always = len(k)
'None' is considered the wild card
The dictionary is pretty huge (this method is taking ~800ms to run and about 300mb), so space is also a consideration

I'm basically looking for either optimizations to this method, or a better way of storing this data.

421

asked Oct 03 '11 15:10

combatdave

3 Answers

You can't optimize this further if you store your data in a plain dictionary, since it does not provide anything faster then a sequential access to all elements of your dictionary in some unpredictable order. This means that your solution is not faster then O(n).

Now, databases. Database is not a universal solution to any (complex enough) problem. Can you reliably estimate the speed/complexity of such lookups for a database? If you scroll to the bottom of this reply, you will see that for large data sets database performance can be much worse than a smart data structure.

What you need here is a hand-crafted data structure. There is a number of choices, it strongly depends on other stuff you are doing with this data. For example: you can keep N sets of sorted lists of your keys, each sorted by n-th tuple element. Then you can quickly select N sorted sets of elements matching only one tuple element at position n, and find their intersection to get the results. This would give an average performance of O(log n)*O(m) where m is an average number of elements in one subset.

Or you can store you items in a k-d tree, this means that you have to pay O(log n) insertion price, but you can do queries like the one above in O(log n) time. Here is an example in python, using k-d tree implementation from SciPy:

from scipy.spatial import kdtree
import itertools
import random

random.seed(1)
data = list(itertools.permutations(range(10), 4))
random.shuffle(data)
data = data[:(len(data)/2)]

tree = kdtree.KDTree(data)

def match(a, b):
    assert len(a) == len(b)
    for i, v in enumerate(a):
        if v != b[i] and (v is not None) and (b[i] is not None):
            return False
    return True

def find_like(kdtree, needle):
    assert len(needle) == kdtree.m
    def do_find(tree, needle):
        if hasattr(tree, 'idx'):
            return list(itertools.ifilter(lambda x: match(needle, x),
                                          kdtree.data[tree.idx]))
        if needle[tree.split_dim] is None:
            return do_find(tree.less, needle) + do_find(tree.greater, needle)
        if needle[tree.split_dim] <= tree.split:
            return do_find(tree.less, needle)
        else:
            return do_find(tree.greater, needle)
    return do_find(kdtree.tree, needle)

def find_like_bf(kdtree, needle):
    assert len(needle) == kdtree.m
    return list(itertools.ifilter(lambda x: match(needle, x),
                                  kdtree.data))

import timeit
print "k-d tree:"
print "%.2f sec" % timeit.timeit("find_like(tree, (1, None, 2, None))",
                                "from __main__ import find_like, tree",
                                number=1000)
print "brute force:"
print "%.2f sec" % timeit.timeit("find_like_bf(tree, (1, None, 2, None))",
                                "from __main__ import find_like_bf, tree",
                                number=1000)

And test run results:

$ python lookup.py
k-d tree:
0.89 sec
brute force:
6.92 sec

Just for fun, also added database-based solution benchmark. The initialization code changed from above to:

random.seed(1)
data = list(itertools.permutations(range(30), 4))
random.shuffle(data)

Now, the "database" implementation:

import sqlite3

db = sqlite3.connect(":memory:")
db.execute("CREATE TABLE a (x1 INTEGER, x2 INTEGER, x3 INTEGER, x4 INTEGER)")
db.execute("CREATE INDEX x1 ON a(x1)")
db.execute("CREATE INDEX x2 ON a(x2)")
db.execute("CREATE INDEX x3 ON a(x3)")
db.execute("CREATE INDEX x4 ON a(x4)")

db.executemany("INSERT INTO a VALUES (?, ?, ?, ?)",
               [[int(x) for x in value] for value in tree.data])

def db_test():
    cur = db.cursor()
    cur.execute("SELECT * FROM a WHERE x1=? AND x3=?", (1, 2))
    return cur.fetchall()

print "sqlite db:"
print "%.2f sec" % timeit.timeit("db_test()",
                                 "from __main__ import db_test",
                                 number=100)

And test results, reduced for 100 runs per benchmark (for resulting 657720-element set of keys):

$ python lookup.py
building tree
done in 6.97 sec
building db
done in 11.59 sec
k-d tree:
1.90 sec
sqlite db:
2.31 sec

It also worth mentioning that building tree took almost twice less time then inserting this test data set into the database.

Complete source here: https://gist.github.com/1261449

answered Sep 21 '22 04:09

abbot

What about just using a database?

I prefer SQLite + SQLAlchemy even for simple projects, but plain sqlite3 might have a gentler learning curve.

Putting an index on each key column should take care of speed issues.

answered Sep 23 '22 04:09

Petr Viktorin

Perhaps you could speed it up by maintaining indexes for your keys. Essentially, something like this:

self.indices[2][5]

would contain a set of all of the keys which have 5 in the third position of the key.

Then you could simply do set intersection between the relevant index entries to get the set of keys:

matching_keys = None

for i in range(self.order):
    if keyWords[i] is not None:
        if matching_keys is None:
            matching_keys = self.indices[i][keyWords[i]]
        else:
            matching_keys &= self.indices[i][keyWords[i]]

matching_keys = list(matching_keys) if matching_keys else []

answered Sep 23 '22 04:09

Amber

Related questions
                            
                                Methods to debug Ubuntu Nautilus scripts written in Python
                            
                                Diamond problem when using MixIns in Python
                            
                                Django star rating system and AJAX
                            
                                Region-of-interest drawing tool for image analysis (in python)
                            
                                How to compile python from source and get a clean/minimal install?
                            
                                Checkboxes in a Combobox using PyQt
                            
                                UTF-8 error with Python and gettext
                            
                                adding an attribute in suds
                            
                                Python for loop slows and evenutally hangs
                            
                                Convert a string to datetime object in python
                            
                                Plotting frequency distributions in python
                            
                                Python error: list indices must be integers, not unicode
                            
                                Django: MultiValueField and MultiWidget
                            
                                Turn an application or script into a shell command
                            
                                Python loop that always executes at least once? [duplicate]
                            
                                Am I using super() correctly?
                            
                                does python have a comma operator like C
                            
                                Compact (archive) old log files in python
                            
                                How do I provide an informal string representation of a python Class (not instance)
                            
                                Convert Unicode Object to Python Dict

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimising partial dictionary key match

Tags:

python

algorithm

optimization