Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove a dictionary key if it is a substring in any other key

I'm learning Python. I've got a performance issue. For a single dictionary, I want to remove keys if

  • The a key is a substring in another key

I don't want to remove keys if

  • The key substring is itself

My keys are unique strings mostly between 3-50 characters in length. The dictionary I'm working with has 100,000 or more items, making billions of comparisons. Since this is an O(n^2) problem, should I stop trying to optimize this code? Or is there room to make headway here?

A dictionary is preferable, but I'm open to other types.

For example: 'hello' contains 'he' and 'ell'. I want to remove the keys 'he' and 'ell' while keeping 'hello'. I'd like to remove prefixes, suffixes, and key-substrings in the middle of other keys.

The keys are generated one-by-one and added to a dictionary. Then reduce_dict(dictionary) runs. My assumption is: a test while they're added to a dictionary would be as slow as a function testing after, as in the code below.

def reduce_dict(dictionary):
    reduced = dictionary.copy()
    for key in dictionary:
        for key2 in dictionary:
            if key != key2:
                if key2 in key:
                    reduced.pop(key2, 0)
    return reduced
like image 226
12345678910111213 Avatar asked Aug 21 '14 08:08

12345678910111213


People also ask

How do you remove a key from a dictionary?

Because key-value pairs in dictionaries are objects, you can delete them using the “del” keyword. The “del” keyword is used to delete a key that does exist. It raises a KeyError if a key is not present in a dictionary.

How remove key from dictionary while iterating?

First, you need to convert the dictionary keys to a list using the list(dict. keys()) method. During each iteration, you can check if the value of a key is equal to the desired value. If it is True , you can issue the del statement to delete the key.


2 Answers

I think you can create a list of "good" keys (=those that are not substrings of others) in a slightly optimized way:

# keys = yourDict.keys(), e.g.
keys = ['low', 'el', 'helloworld', 'something', 'ellow', 'thing', 'blah', 'thingy']

# flt is [[key, is_substring],...] sorted by key length reversed
flt = [[x, 0] for x in sorted(keys, key=len, reverse=True)]

for i in range(len(flt)):
    p = flt[i]
    if p[1]:  # already removed
        continue
    for j in range(i + 1, len(flt)): # iterate over shorter strings
        q = flt[j]
        if not q[1] and q[0] in p[0]: # if not already removed and is substring
            q[1] = 1  # remove

goodkeys = set(x[0] for x in flt if not x[1])
print goodkeys # e.g ['helloworld', 'something', 'thingy', 'blah']

Now the removal is trivial:

newdict = {k:olddict[k] for k in goodkeys}
like image 58
georg Avatar answered Oct 17 '22 08:10

georg


Given that your strings are somewhat small, you could store a hashset of all possible substrings for each key. That would enable you to, for a given substring, find all keys that have matching substrings in O(N) time, however the tradeoff is you'd be increasing thr time complexity of your insertions since you'd be constructing a set of substrings for each new key.

like image 28
Sean Azlin Avatar answered Oct 17 '22 07:10

Sean Azlin