Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

defaultdict vs dict element initialization

I am trying to optimize the performance of a script that looks up similar words in a lexicon for each word given.

Each unique word is to be split into letter n-grams and for each n-gram, the lexicon returns a list of words that contain the same letter n-gram. Each word from this list is then added to a dictionary as a key and it's value is incremented by one. This gives me a dictionary of similar words with corresponding frequency scores.

word_dict = {}
get = word_dict.get
for letter_n_gram in word:
    for entry in lexicon[n_gram]:
        word_dict[entry] = get(entry, 0) + 1

This implementation works, but the script could supposedly run faster by switching the dict for collections.defaultdict.

word_dd = defaultdict(int)
for letter_n_gram in word:
    for entry in lexicon[n_gram]:
        word_dd[entry] += 1

No other code has been changed.

I was under the impression that both code snippets (most importantly the score adding) should work in the exact same way, i.e. if the key exists, increase its value by 1, if it does not exist, create the key and set the value to 1.

After running the new code, however, some of the keys had values of 0, which I find logically impossible.

Is my logic or knowledge of defaultdict functionality flawed? If not, how can any value in word_dd be set to 0?

edit: I am also very sure that no other part of the script skews these results, as I test the dictionary immediately after shown code by using:

for item in word_dd.iteritems():
    if item[1] == 0:
        print "Found zero value element"
        break
like image 743
Deutherius Avatar asked Dec 14 '22 23:12

Deutherius


1 Answers

When you access a key in a defaultdict, if it is not there, it will be created automatically. Since we have int as the default factory function, it creates the key and gives the default value 0.

from collections import defaultdict
d = defaultdict(int)
print d["a"]
# 0
print d
# defaultdict(<type 'int'>, {'a': 0})

So, before accessing the key, you should make sure that it exists in the defaultdict instance, like this

print "a" in d
# False
like image 105
thefourtheye Avatar answered Dec 30 '22 04:12

thefourtheye