Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimising partial dictionary key match

I have a dictionary which uses a 4-tuple as it's key. I need to find all the keys in the dictionary which partially match some other tuple. I have some code which does this but it's slow and needs optimizing.

Here's what I'm after:

Keys:
(1, 2, 3, 4)
(1, 3, 5, 2)
(2, 4, 8, 7)
(1, 4, 3, 4)
Match:
(1, None, 3, None)
Result:
[(1, 2, 3, 4), (1, 4, 3, 4)]

Current code:

def GetTuples(self, keyWords):
    tuples = []
    for k in self.chain.iterkeys():
        match = True
        for i in range(self.order):
            if keyWords[i] is not None and keyWords[i] != k[i]:
                match = False
                break
        if match is True:
            tuples.append(k)
    return tuples
  • keyWords is a list containing the values I want to match
  • self.chain is the dictionary
  • self.order is the size of the tuple
  • len(keyWords) always = len(k)
  • 'None' is considered the wild card
  • The dictionary is pretty huge (this method is taking ~800ms to run and about 300mb), so space is also a consideration

I'm basically looking for either optimizations to this method, or a better way of storing this data.

like image 421
combatdave Avatar asked Oct 03 '11 15:10

combatdave


People also ask

Can dictionary can have two same keys with different values?

You can't. Keys have to be unique.

What will happen if two keys are same in dictionary?

No, each key in a dictionary should be unique. You can't have two keys with the same value. Attempting to use the same key again will just overwrite the previous value stored. If a key needs to store multiple values, then the value associated with the key should be a list or another dictionary.

How do you remove multiple keys from a dictionary?

Remove multiple keys from a dictionary using del Here, we are using a Python loop to iterate rem_list, while iterating if a key matches with the test_dict we delete it using the del keyword.

How do you separate a key from a value in a dictionary?

Creating a Dictionary To do that you separate the key-value pairs by a colon(“:”). The keys would need to be of an immutable type, i.e., data-types for which the keys cannot be changed at runtime such as int, string, tuple, etc. The values can be of any type.


3 Answers

You can't optimize this further if you store your data in a plain dictionary, since it does not provide anything faster then a sequential access to all elements of your dictionary in some unpredictable order. This means that your solution is not faster then O(n).

Now, databases. Database is not a universal solution to any (complex enough) problem. Can you reliably estimate the speed/complexity of such lookups for a database? If you scroll to the bottom of this reply, you will see that for large data sets database performance can be much worse than a smart data structure.

What you need here is a hand-crafted data structure. There is a number of choices, it strongly depends on other stuff you are doing with this data. For example: you can keep N sets of sorted lists of your keys, each sorted by n-th tuple element. Then you can quickly select N sorted sets of elements matching only one tuple element at position n, and find their intersection to get the results. This would give an average performance of O(log n)*O(m) where m is an average number of elements in one subset.

Or you can store you items in a k-d tree, this means that you have to pay O(log n) insertion price, but you can do queries like the one above in O(log n) time. Here is an example in python, using k-d tree implementation from SciPy:

from scipy.spatial import kdtree
import itertools
import random

random.seed(1)
data = list(itertools.permutations(range(10), 4))
random.shuffle(data)
data = data[:(len(data)/2)]

tree = kdtree.KDTree(data)

def match(a, b):
    assert len(a) == len(b)
    for i, v in enumerate(a):
        if v != b[i] and (v is not None) and (b[i] is not None):
            return False
    return True

def find_like(kdtree, needle):
    assert len(needle) == kdtree.m
    def do_find(tree, needle):
        if hasattr(tree, 'idx'):
            return list(itertools.ifilter(lambda x: match(needle, x),
                                          kdtree.data[tree.idx]))
        if needle[tree.split_dim] is None:
            return do_find(tree.less, needle) + do_find(tree.greater, needle)
        if needle[tree.split_dim] <= tree.split:
            return do_find(tree.less, needle)
        else:
            return do_find(tree.greater, needle)
    return do_find(kdtree.tree, needle)

def find_like_bf(kdtree, needle):
    assert len(needle) == kdtree.m
    return list(itertools.ifilter(lambda x: match(needle, x),
                                  kdtree.data))

import timeit
print "k-d tree:"
print "%.2f sec" % timeit.timeit("find_like(tree, (1, None, 2, None))",
                                "from __main__ import find_like, tree",
                                number=1000)
print "brute force:"
print "%.2f sec" % timeit.timeit("find_like_bf(tree, (1, None, 2, None))",
                                "from __main__ import find_like_bf, tree",
                                number=1000)

And test run results:

$ python lookup.py
k-d tree:
0.89 sec
brute force:
6.92 sec

Just for fun, also added database-based solution benchmark. The initialization code changed from above to:

random.seed(1)
data = list(itertools.permutations(range(30), 4))
random.shuffle(data)

Now, the "database" implementation:

import sqlite3

db = sqlite3.connect(":memory:")
db.execute("CREATE TABLE a (x1 INTEGER, x2 INTEGER, x3 INTEGER, x4 INTEGER)")
db.execute("CREATE INDEX x1 ON a(x1)")
db.execute("CREATE INDEX x2 ON a(x2)")
db.execute("CREATE INDEX x3 ON a(x3)")
db.execute("CREATE INDEX x4 ON a(x4)")

db.executemany("INSERT INTO a VALUES (?, ?, ?, ?)",
               [[int(x) for x in value] for value in tree.data])

def db_test():
    cur = db.cursor()
    cur.execute("SELECT * FROM a WHERE x1=? AND x3=?", (1, 2))
    return cur.fetchall()

print "sqlite db:"
print "%.2f sec" % timeit.timeit("db_test()",
                                 "from __main__ import db_test",
                                 number=100)

And test results, reduced for 100 runs per benchmark (for resulting 657720-element set of keys):

$ python lookup.py
building tree
done in 6.97 sec
building db
done in 11.59 sec
k-d tree:
1.90 sec
sqlite db:
2.31 sec

It also worth mentioning that building tree took almost twice less time then inserting this test data set into the database.

Complete source here: https://gist.github.com/1261449

like image 88
abbot Avatar answered Sep 21 '22 04:09

abbot


What about just using a database?

I prefer SQLite + SQLAlchemy even for simple projects, but plain sqlite3 might have a gentler learning curve.

Putting an index on each key column should take care of speed issues.

like image 4
Petr Viktorin Avatar answered Sep 23 '22 04:09

Petr Viktorin


Perhaps you could speed it up by maintaining indexes for your keys. Essentially, something like this:

self.indices[2][5]

would contain a set of all of the keys which have 5 in the third position of the key.

Then you could simply do set intersection between the relevant index entries to get the set of keys:

matching_keys = None

for i in range(self.order):
    if keyWords[i] is not None:
        if matching_keys is None:
            matching_keys = self.indices[i][keyWords[i]]
        else:
            matching_keys &= self.indices[i][keyWords[i]]

matching_keys = list(matching_keys) if matching_keys else []
like image 4
Amber Avatar answered Sep 23 '22 04:09

Amber