I am trying to calculate word frequency for a text file of size 1.2 GB which was around 203 million words. I am using the following Python code. But its giving me a memory error. Is there any solution for this?
Here is my code:
import re
# this one in honor of 4th July, or pick text file you have!!!!!!!
filename = 'inputfile.txt'
# create list of lower case words, \s+ --> match any whitespace(s)
# you can replace file(filename).read() with given string
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation marks to be removed
punctuation = re.compile(r'[.?!,":;]')
for word in word_list:
# remove punctuation marks
word = punctuation.sub("", word)
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
print 'Unique words:', len(freq_dic)
# create list of (key, val) tuple pairs
freq_list = freq_dic.items()
# sort by key or word
freq_list.sort()
# display result
for word, freq in freq_list:
print word, freq
And here is the error, I received:
Traceback (most recent call last):
File "count.py", line 6, in <module>
word_list = re.split('\s+', file(filename).read().lower())
File "/usr/lib/python2.7/re.py", line 167, in split
return _compile(pattern, flags).split(string, maxsplit)
MemoryError
The problem begins right here:
file(filename).read()
This reads in the whole file into a string. Instead, if you process the file line-by-line or chunk-by-chunk, you won't run into a memory problem.
with open(filename) as f:
for line in f:
You could also benefit from using a collections.Counter to count the frequency of words.
In [1]: import collections
In [2]: freq = collections.Counter()
In [3]: line = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod'
In [4]: freq.update(line.split())
In [5]: freq
Out[5]: Counter({'ipsum': 1, 'amet,': 1, 'do': 1, 'sit': 1, 'eiusmod': 1, 'consectetur': 1, 'sed': 1, 'elit,': 1, 'dolor': 1, 'Lorem': 1, 'adipisicing': 1})
And to count some more words,
In [6]: freq.update(line.split())
In [7]: freq
Out[7]: Counter({'ipsum': 2, 'amet,': 2, 'do': 2, 'sit': 2, 'eiusmod': 2, 'consectetur': 2, 'sed': 2, 'elit,': 2, 'dolor': 2, 'Lorem': 2, 'adipisicing': 2})
A collections.Counter
is a subclass of dict
, so you can use it in ways with which you are already familiar. In addition, it has some useful methods for counting such as most_common.
The problem is that you are trying to read the entire file into memory. The solution is read the file line by line, count the words of each line, and sum the results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With