I was trying to count duplicate words over a list of 230 thousand words.I used python dictionary to do so. The code is given below:
for words in word_list:
if words in word_dict.keys():
word_dict[words] += 1
else:
word_dict[words] = 1
The above code took 3 minutes!. I ran the same code over 1.5 million words and it was running for more than 25 minutes and I lost my patience and terminated. Then I found that I can use the following code from here (also shown below). The result was so surprising, it completed within seconds!. So my question is what is the faster way to do this operation?. I guess the dictionary creation process must be taking O(N) time. How was the Counter method able to complete this process in seconds and create an exact dictionary of word as key and frequency as it's value?
from collections import Counter
word_dict = Counter(word_list)
Python Code:def word_count(str): counts = dict() words = str. split() for word in words: if word in counts: counts[word] += 1 else: counts[word] = 1 return counts print( word_count('the quick brown fox jumps over the lazy dog. '))
By using the counter() method we can easily count the duplicate values in the dictionary. This method is a part of a collection module and it is used for counting the hashable objects.
The straight answer is NO. You can not have duplicate keys in a dictionary in Python.
It's because of this:
if words in word_dict.keys():
.keys()
returns a list of all the keys. Lists take linear time to scan, so your program was running in quadratic time!
Try this instead:
if words in word_dict:
Also, if you're interested, you can see the Counter
implementation for yourself. It's written in regular Python.
your dictionary counting method is not well constructed.
you could have used a defaultdict
in the following way:
d = defaultdict(int)
for word in word_list:
d[word] += 1
but the counter
method from itertools is still faster even though it is doing almost the same thing, because it is written in a more efficient implementation. however, with the counter method, you need to pass it a list to count, whereas using a defaultdict, you can put sources from different locations and have a more complicated loop.
ultimately it is your preference. if counting a list, counter
is the way to go, if iterating from multiple sources, or you simply want a counter in your program and dont want the extra lookup to check if an item is already being counted or not. then defaultdict
is your choice.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With