Efficiently count word frequencies in python

Tags:

I'd like to count frequencies of all words in a text file.

>>> countInFile('test.txt')

should return {'aaa':1, 'bbb': 2, 'ccc':1} if the target text file is like:

# test.txt aaa bbb ccc bbb

I've implemented it with pure python following some posts. However, I've found out pure-python ways are insufficient due to huge file size (> 1GB).

I think borrowing sklearn's power is a candidate.

If you let CountVectorizer count frequencies for each line, I guess you will get word frequencies by summing up each column. But, it sounds a bit indirect way.

What is the most efficient and straightforward way to count words in a file with python?

Update

My (very slow) code is here:

from collections import Counter  def get_term_frequency_in_file(source_file_path):     wordcount = {}     with open(source_file_path) as f:         for line in f:             line = line.lower().translate(None, string.punctuation)             this_wordcount = Counter(line.split())             wordcount = add_merge_two_dict(wordcount, this_wordcount)     return wordcount  def add_merge_two_dict(x, y):     return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }

814

asked Mar 08 '16 01:03

Light Yagmi

2 Answers

The most succinct approach is to use the tools Python gives you.

from future_builtins import map  # Only on Python 2  from collections import Counter from itertools import chain  def countInFile(filename):     with open(filename) as f:         return Counter(chain.from_iterable(map(str.split, f)))

That's it. map(str.split, f) is making a generator that returns lists of words from each line. Wrapping in chain.from_iterable converts that to a single generator that produces a word at a time. Counter takes an input iterable and counts all unique values in it. At the end, you return a dict-like object (a Counter) that stores all unique words and their counts, and during creation, you only store a line of data at a time and the total counts, not the whole file at once.

In theory, on Python 2.7 and 3.1, you might do slightly better looping over the chained results yourself and using a dict or collections.defaultdict(int) to count (because Counter is implemented in Python, which can make it slower in some cases), but letting Counter do the work is simpler and more self-documenting (I mean, the whole goal is counting, so use a Counter). Beyond that, on CPython (the reference interpreter) 3.2 and higher Counter has a C level accelerator for counting iterable inputs that will run faster than anything you could write in pure Python.

Update: You seem to want punctuation stripped and case-insensitivity, so here's a variant of my earlier code that does that:

from string import punctuation  def countInFile(filename):     with open(filename) as f:         linewords = (line.translate(None, punctuation).lower().split() for line in f)         return Counter(chain.from_iterable(linewords))

Your code runs much more slowly because it's creating and destroying many small Counter and set objects, rather than .update-ing a single Counter once per line (which, while slightly slower than what I gave in the updated code block, would be at least algorithmically similar in scaling factor).

109

answered Oct 01 '22 12:10

ShadowRanger

A memory efficient and accurate way is to make use of

CountVectorizer in scikit (for ngram extraction)
NLTK for word_tokenize
numpy matrix sum to collect the counts
collections.Counter for collecting the counts and vocabulary

An example:

import urllib.request from collections import Counter  import numpy as np   from nltk import word_tokenize from sklearn.feature_extraction.text import CountVectorizer  # Our sample textfile. url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt' response = urllib.request.urlopen(url) data = response.read().decode('utf8')   # Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens. ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1) # X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary X = ngram_vectorizer.fit_transform(data.split('\n'))  # Vocabulary vocab = list(ngram_vectorizer.get_feature_names())  # Column-wise sum of the X matrix. # It's some crazy numpy syntax that looks horribly unpythonic # For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array # and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently counts = X.sum(axis=0).A1  freq_distribution = Counter(dict(zip(vocab, counts))) print (freq_distribution.most_common(10))

[out]:

[(',', 32000),  ('.', 17783),  ('de', 11225),  ('a', 7197),  ('que', 5710),  ('la', 4732),  ('je', 4304),  ('se', 4013),  ('на', 3978),  ('na', 3834)]

Essentially, you can also do this:

from collections import Counter import numpy as np  from nltk import word_tokenize from sklearn.feature_extraction.text import CountVectorizer  def freq_dist(data):     """     :param data: A string with sentences separated by '\n'     :type data: str     """     ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)     X = ngram_vectorizer.fit_transform(data.split('\n'))     vocab = list(ngram_vectorizer.get_feature_names())     counts = X.sum(axis=0).A1     return Counter(dict(zip(vocab, counts)))

Let's timeit:

import time  start = time.time() word_distribution = freq_dist(data) print (time.time() - start)

[out]:

5.257147789001465

Note that CountVectorizer can also take a file instead of a string and there's no need to read the whole file into memory. In code:

import io from collections import Counter  import numpy as np from sklearn.feature_extraction.text import CountVectorizer  infile = '/path/to/input.txt'  ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)  with io.open(infile, 'r', encoding='utf8') as fin:     X = ngram_vectorizer.fit_transform(fin)     vocab = ngram_vectorizer.get_feature_names()     counts = X.sum(axis=0).A1     freq_distribution = Counter(dict(zip(vocab, counts)))     print (freq_distribution.most_common(10))

answered Oct 01 '22 12:10

alvas

Related questions
                            
                                a = open("file", "r"); a.readline() output without \n [duplicate]
                            
                                How do I use numba on a member function of a class?
                            
                                Is there a better way to write nested if statements in python? [closed]
                            
                                How can I access Amazon DynamoDB via Python?
                            
                                Python: Finding a trend in a set of numbers
                            
                                Python decorator? - can someone please explain this?
                            
                                Selecting elements of a Python dictionary greater than a certain value
                            
                                Subtract seconds from datetime in python
                            
                                TensorFlow operator overloading
                            
                                TensorFlow wasn't compiled to use SSE (etc.) instructions, but these are available
                            
                                What is __qualname__ in python?
                            
                                What is the best solution for database connection pooling in python?
                            
                                RAII in Python - automatic destruction when leaving a scope
                            
                                Python cannot find dateutil.relativedelta
                            
                                Flask redirecting multiple routes
                            
                                Why is the id of a Python class not unique when called quickly?
                            
                                How to use alter_column in alembic?
                            
                                Multiple files for one argument in argparse Python 2.7
                            
                                pandas left join and update existing column
                            
                                django setting environment variables in unittest tests

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently count word frequencies in python

Tags:

python

nlp

word-count

scikit-learn

frequency-distribution