Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting all keys in a dict that overlap with other keys in the same dict

I have a list comprehension that looks like this:

cart = [ ((p,pp),(q,qq)) for ((p,pp),(q,qq))\
         in itertools.product(C.items(), repeat=2)\
         if p[1:] == q[:-1] ]

C is a dict with keys that are tuples of arbitrary integers . All the tuples have the same length. Worst case is that all the combinations should be included in the new list. This can happen quite frequently.

As an example, I have a dictionary like this:

C = { (0,1):'b',
      (2,0):'c',
      (0,0):'d' }

And I want the the result to be:

cart = [ (((2, 0), 'c'), ((0, 1), 'b'))
         (((2, 0), 'c'), ((0, 0), 'd'))
         (((0, 0), 'd'), ((0, 1), 'b'))
         (((0, 0), 'd'), ((0, 0), 'd')) ]

So, by overlap I am referring to, for instance, that the tuples (1,2,3,4) and (2,3,4,5) have the overlapping section (2,3,4). The overlapping sections must be on the "edges" of the tuples. I only want overlaps that have length one shorter than the tuple length. Thus (1,2,3,4) does not overlap with (3,4,5,6). Also note that when removing the first or last element of a tuple we might end up with non-distinct tuples, all of which must be compared to all the other elements. This last point was not emphasized in my first example.

The better part of my codes execution time is spent in this list comprehension. I always need all elements of cart so there appears to be no speedup when using a generator instead.

My question is: Is there a faster way of doing this?

A thought I had was that I could try to create two new dictionaries like this:

aa = defaultdict(list)
bb = defaultdict(list)
[aa[p[1:]].append(p) for p in C.keys()]
[bb[p[:-1]].append(p) for p in C.keys()]

And somehow merge all combinations of elements of the list in aa[i] with the list in bb[i] for all i, but I can not seem to wrap my head around this idea either.

Update

Both the solution added by tobias_k and shx2 have better complexity than my original code (as far as I can tell). My code is O(n^2) whereas the two other solutions are O(n). For my problem size and composition however, all three solutions seem to run at more or less the same time. I suppose this has to do with a combination of overhead associated with function calls, as well as the nature of the data I am working with. In particular the number of different keys, as well as the actual composition of the keys, seem to have a large impact. The latter I know because the code runs much slower for completely random keys. I have accepted tobias_k's answer because his code is the easiest to follow. However, i would still greatly welcome other suggestions on how to perform this task.

like image 216
inconvergent Avatar asked Mar 08 '13 11:03

inconvergent


2 Answers

You were actually on the right track, using the dictionaries to store all the prefixes to the keys. However, keep in mind that (as far as I understand the question) two keys can also overlap if the overlap is less than len-1, e.g. the keys (1,2,3,4) and (3,4,5,6) would overlap, too. Thus we have to create a map holding all the prefixes of the keys. (If I am mistaken about this, just drop the two inner for loops.) Once we have this map, we can iterate over all the keys a second time, and check whether for any of their suffixes there are matching keys in the prefixes map. (Update: Since keys can overlap w.r.t. more than one prefix/suffix, we store the overlapping pairs in a set.)

def get_overlap(keys):
    # create map: prefix -> set(keys with that prefix)
    prefixes = defaultdict(set)
    for key in keys:
        for prefix in [key[:i] for i in range(len(key))]:
            prefixes[prefix].add(key)
    # get keys with matching prefixes for all suffixes
    overlap = set()
    for key in keys:
        for suffix in [key[i:] for i in range(len(key))]:
            overlap.update([(key, other) for other in prefixes[suffix]
                                                      if other != key])
    return overlap

(Note that, for simplicity, I only care about the keys in the dictionary, not the values. Extending this to return the values, too, or doing this as a postprocessing step, should be trivial.)

Overall running time should be only 2*n*k, n being the number of keys and k the length of the keys. Space complexity (the size of the prefixes map) should be between n*k and n^2*k, if there are very many keys with the same prefixes.

Note: The above answer is for the more general case that the overlapping region can have any length. For the simpler case that you consider only overlaps one shorter than the original tuple, the following should suffice and yield the results described in your examples:

def get_overlap_simple(keys):
    prefixes = defaultdict(list)
    for key in keys:
        prefixes[key[:-1]].append(key)
    return [(key, other) for key in keys for other in prefixes[key[1:]]]
like image 60
tobias_k Avatar answered Nov 14 '22 22:11

tobias_k


Your idea of preprocessing the data into a dict was a good one. Here goes:

from itertools import groupby
C = { (0,1): 'b', (2,0): 'c', (0,0): 'd' }

def my_groupby(seq, key):
    """
     >>> group_by(range(10), lambda x: 'mod=%d' % (x % 3))
     {'mod=2': [2, 5, 8], 'mod=0': [0, 3, 6, 9], 'mod=1': [1, 4, 7]}
    """
    groups = dict()
    for x in seq:
        y = key(x)
        groups.setdefault(y, []).append(x)
    return groups

def get_overlapping_items(C):
    prefixes = my_groupby(C.iteritems(), key = lambda (k,v): k[:-1])
    for k1, v1 in C.iteritems():
        prefix = k1[1:]
        for k2, v2 in prefixes.get(prefix, []):
            yield (k1, v1), (k2, v2)

for x in get_overlapping_items(C):
    print x     

(((2, 0), 'c'), ((0, 1), 'b'))
(((2, 0), 'c'), ((0, 0), 'd'))
(((0, 0), 'd'), ((0, 1), 'b'))
(((0, 0), 'd'), ((0, 0), 'd'))

And by the way, instead of:

itertools.product(*[C.items()]*2)

do:

itertools.product(C.items(), repeat=2)
like image 27
shx2 Avatar answered Nov 14 '22 23:11

shx2