I'm learning Python. I've got a performance issue. For a single dictionary, I want to remove keys if
I don't want to remove keys if
My keys are unique strings mostly between 3-50 characters in length. The dictionary I'm working with has 100,000 or more items, making billions of comparisons. Since this is an O(n^2) problem, should I stop trying to optimize this code? Or is there room to make headway here?
A dictionary is preferable, but I'm open to other types.
For example: 'hello' contains 'he' and 'ell'. I want to remove the keys 'he' and 'ell' while keeping 'hello'. I'd like to remove prefixes, suffixes, and key-substrings in the middle of other keys.
The keys are generated one-by-one and added to a dictionary. Then reduce_dict(dictionary)
runs. My assumption is: a test while they're added to a dictionary would be as slow as a function testing after, as in the code below.
def reduce_dict(dictionary):
reduced = dictionary.copy()
for key in dictionary:
for key2 in dictionary:
if key != key2:
if key2 in key:
reduced.pop(key2, 0)
return reduced
Because key-value pairs in dictionaries are objects, you can delete them using the “del” keyword. The “del” keyword is used to delete a key that does exist. It raises a KeyError if a key is not present in a dictionary.
First, you need to convert the dictionary keys to a list using the list(dict. keys()) method. During each iteration, you can check if the value of a key is equal to the desired value. If it is True , you can issue the del statement to delete the key.
I think you can create a list of "good" keys (=those that are not substrings of others) in a slightly optimized way:
# keys = yourDict.keys(), e.g.
keys = ['low', 'el', 'helloworld', 'something', 'ellow', 'thing', 'blah', 'thingy']
# flt is [[key, is_substring],...] sorted by key length reversed
flt = [[x, 0] for x in sorted(keys, key=len, reverse=True)]
for i in range(len(flt)):
p = flt[i]
if p[1]: # already removed
continue
for j in range(i + 1, len(flt)): # iterate over shorter strings
q = flt[j]
if not q[1] and q[0] in p[0]: # if not already removed and is substring
q[1] = 1 # remove
goodkeys = set(x[0] for x in flt if not x[1])
print goodkeys # e.g ['helloworld', 'something', 'thingy', 'blah']
Now the removal is trivial:
newdict = {k:olddict[k] for k in goodkeys}
Given that your strings are somewhat small, you could store a hashset of all possible substrings for each key. That would enable you to, for a given substring, find all keys that have matching substrings in O(N) time, however the tradeoff is you'd be increasing thr time complexity of your insertions since you'd be constructing a set of substrings for each new key.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With