counting duplicate words in python the fastest way

Tags:

I was trying to count duplicate words over a list of 230 thousand words.I used python dictionary to do so. The code is given below:

for words in word_list:
    if words in word_dict.keys():
       word_dict[words] += 1
    else:
       word_dict[words] = 1

The above code took 3 minutes!. I ran the same code over 1.5 million words and it was running for more than 25 minutes and I lost my patience and terminated. Then I found that I can use the following code from here (also shown below). The result was so surprising, it completed within seconds!. So my question is what is the faster way to do this operation?. I guess the dictionary creation process must be taking O(N) time. How was the Counter method able to complete this process in seconds and create an exact dictionary of word as key and frequency as it's value?

from collections import Counter
word_dict = Counter(word_list)

976

asked Jan 17 '13 08:01

Rkz

2 Answers

It's because of this:

if words in word_dict.keys():

.keys() returns a list of all the keys. Lists take linear time to scan, so your program was running in quadratic time!

Try this instead:

if words in word_dict:

Also, if you're interested, you can see the Counter implementation for yourself. It's written in regular Python.

130

answered Oct 05 '22 05:10

Eevee

your dictionary counting method is not well constructed.

you could have used a defaultdict in the following way:

d = defaultdict(int)

for word in word_list:
    d[word] += 1

but the counter method from itertools is still faster even though it is doing almost the same thing, because it is written in a more efficient implementation. however, with the counter method, you need to pass it a list to count, whereas using a defaultdict, you can put sources from different locations and have a more complicated loop.

ultimately it is your preference. if counting a list, counter is the way to go, if iterating from multiple sources, or you simply want a counter in your program and dont want the extra lookup to check if an item is already being counted or not. then defaultdict is your choice.

answered Oct 05 '22 05:10

Inbar Rose

Related questions
                            
                                matplotlib sequence of figures in the same window
                            
                                Parsing unclosed `<br>` tags with BeautifulSoup
                            
                                This character - ㎜ - raises a UnicodeEncodeError
                            
                                Finding Sum of a Column in a List Getting "TypeError: cannot perform reduce with flexible type"
                            
                                How to implement optional first argument (to reproduce slice() behavior) [duplicate]
                            
                                Elegant way to safely .text.strip() in BeautifulSoup?
                            
                                Recursion on Fibonacci Sequence
                            
                                How to pass multiple variable from php to python script
                            
                                Get element inside current element using xpath
                            
                                Model by name in SQLAlchemy
                            
                                Setting UAC to requireAdministrator using PyInstaller onefile option and manifest
                            
                                For-loops in Python 3.0
                            
                                Why are defaults not appearing in my command-line argument dictionary from docopt?
                            
                                Python Module for Session Management
                            
                                Comparing lists by reference vs value in Python
                            
                                Fixing faulty unicode strings
                            
                                Set a cookie and retrieve it with Python and WSGI
                            
                                python tkinter with threading causing crash
                            
                                python tuples and lists. A tuple that refuses to convert
                            
                                KeyEvent in MainWindow (PyQt4)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

counting duplicate words in python the fastest way

Tags:

performance

python

dictionary

hashtable

word-count

Rkz

People also ask

2 Answers

Eevee

Inbar Rose

Recent Activity

Donate For Us