Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

counting duplicate words in python the fastest way

I was trying to count duplicate words over a list of 230 thousand words.I used python dictionary to do so. The code is given below:

for words in word_list:
    if words in word_dict.keys():
       word_dict[words] += 1
    else:
       word_dict[words] = 1

The above code took 3 minutes!. I ran the same code over 1.5 million words and it was running for more than 25 minutes and I lost my patience and terminated. Then I found that I can use the following code from here (also shown below). The result was so surprising, it completed within seconds!. So my question is what is the faster way to do this operation?. I guess the dictionary creation process must be taking O(N) time. How was the Counter method able to complete this process in seconds and create an exact dictionary of word as key and frequency as it's value?

from collections import Counter
word_dict = Counter(word_list)
like image 976
Rkz Avatar asked Jan 17 '13 08:01

Rkz


People also ask

How do you count repeated words in Python?

Python Code:def word_count(str): counts = dict() words = str. split() for word in words: if word in counts: counts[word] += 1 else: counts[word] = 1 return counts print( word_count('the quick brown fox jumps over the lazy dog. '))

How do you count duplicates in a dictionary?

By using the counter() method we can easily count the duplicate values in the dictionary. This method is a part of a collection module and it is used for counting the hashable objects.

Can Python dictionaries have duplicate values?

The straight answer is NO. You can not have duplicate keys in a dictionary in Python.


2 Answers

It's because of this:

if words in word_dict.keys():

.keys() returns a list of all the keys. Lists take linear time to scan, so your program was running in quadratic time!

Try this instead:

if words in word_dict:

Also, if you're interested, you can see the Counter implementation for yourself. It's written in regular Python.

like image 130
Eevee Avatar answered Oct 05 '22 05:10

Eevee


your dictionary counting method is not well constructed.

you could have used a defaultdict in the following way:

d = defaultdict(int)

for word in word_list:
    d[word] += 1

but the counter method from itertools is still faster even though it is doing almost the same thing, because it is written in a more efficient implementation. however, with the counter method, you need to pass it a list to count, whereas using a defaultdict, you can put sources from different locations and have a more complicated loop.

ultimately it is your preference. if counting a list, counter is the way to go, if iterating from multiple sources, or you simply want a counter in your program and dont want the extra lookup to check if an item is already being counted or not. then defaultdict is your choice.

like image 44
Inbar Rose Avatar answered Oct 05 '22 05:10

Inbar Rose