I've been playing around with different ways (in Python 2.7) to extract a list of (word, frequency) tuples from a corpus, or list of strings, and comparing their efficiency. As far as I can tell, in the normal case with an unsorted list, the Counter
method from the collections
module is superior to anything I've come up with or found elsewhere, but it doesn't seem to take much advantage of the benefits of a pre-sorted list and I've come up with methods that beat it easily in this special case. So, in short, is there any built-in way to inform Counter
of the fact that a list is already sorted to further speed it up?
(The next section is on unsorted lists where Counter works magic; you may want to skip to towards the end where it looses its charm when dealing with sorted lists.)
The naive approach would be to use sorted([(word, corpus.count(word)) for word in set(corpus)])
, but that one reliably gets you into runtime problems as soon as your corpus is a few thousand items long - not surprisingly since you're running through the entire list of n words m many times, where m is the number of unique words.
So what I tried to do instead before I found Counter
was make sure that all searches are strictly local by first sorting the input list (I also have to remove digits and punctuation marks and convert all entries into lower case to avoid duplicates like 'foo', 'Foo', and 'foo:').
#Natural Language Toolkit, for access to corpus; any other source for a long text will do, though.
import nltk
# nltk corpora come as a class of their own, as I udnerstand it presenting to the
# outside as a unique list but underlyingly represented as several lists, with no more
# than one ever loaded into memory at any one time, which is good for memory issues
# but rather not so for speed so let's disable this special feature by converting it
# back into a conventional list:
corpus = list(nltk.corpus.gutenberg.words())
import string
drop = string.punctuation+string.digits
def wordcount5(corpus, Case=False, lower=False, StrippedSorted=False):
'''function for extracting word frequencies out of a corpus. Returns an alphabetic list
of tuples consisting of words contained in the corpus with their frequencies.
Default is case-insensitive, but if you need separate entries for upper and lower case
spellings of the same words, set option Case=True. If your input list is already sorted
and stripped of punctuation marks/digits and/or all lower case, you can accelerate the
operation by a factor of 5 or so by declaring so through the options "Sorted" and "lower".'''
# you can ignore the following 6 lines for now, they're only relevant with a pre-processed input list
if lower or Case:
if StrippedSorted:
sortedc = corpus
else:
sortedc = sorted([word.replace('--',' ').strip(drop)
for word in sorted(corpus)])
# here we sort and purge the input list in the default case:
else:
sortedc = sorted([word.lower().replace('--',' ').strip(drop)
for word in sorted(corpus)])
# start iterating over the (sorted) input list:
scindex = 0
# create a list:
freqs = []
# identify the first token:
currentword = sortedc[0]
length = len(sortedc)
while scindex < length:
wordcount = 0
# increment a local counter while the tokens == currentword
while scindex < length and sortedc[scindex] == currentword:
scindex += 1
wordcount += 1
# store the current word and final score when a) a new word appears or
# b) the end of the list is reached
freqs.append((currentword, wordcount))
# if a): update currentword with the current token
if scindex < length:
currentword = sortedc[scindex]
return freqs
collections.Counter
This is much better, but still not quite as fast as using the Counter class from the collections module, which creates a dictionary of {word: frequency of word} entries (we still have to do the same stripping and lowering, but no sorting):
from collections import Counter
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
cnt[word] += 1
# optionally, if we want to have the same output format as before,
# we can do the following which negligibly adds in runtime:
wordfreqs = sorted([(word, cnt[word]) for word in cnt])
On the Gutenberg corpus with appr. 2 million entries, the Counter method is roughly 30% faster on my machine (5 seconds as opposed to 7.2), which is mostly explained through the sorting routine which eats around 2.1 seconds (if you don't have and don't want to install the nltk package (Natural Language Toolkit) which offers access to this corpus, any other adequately long text appropriately split into a list of strings at word level will show you the same.)
With my idiosyncratic method of timing using the tautology as a conditional to delay execution, this gives us for the counter method:
import time
>>> if 1:
... start = time.time()
... cnt = Counter()
... for word in [token.lower().strip(drop) for token in corpus if token not in [" ", ""]]:
... cnt[word] += 1
... time.time()-start
... cntgbfreqs = sorted([(word, cnt[word]) for word in cnt])
... time.time()-start
...
4.999882936477661
5.191655874252319
(We see that the last step, that of formatting the results as a list of tuples, takes up less than 5% of the total time.)
Compared to my function:
>>> if 1:
... start = time.time()
... gbfreqs = wordcount5(corpus)
... time.time()-start
...
7.261770963668823
Counter
'fails'However, as you may have noticed, my function allows to specify that the input is already sorted, stripped of punctuational garbage, and converted to lowercase. If we already have created such a converted version of the list for some other operations, using it (and declaring so) can very much speed up the operation of my wordcount5
:
>>> sorted_corpus = sorted([token.lower().strip(drop) for token in corpus if token not in [" ", ""]])
>>> if 1:
... start = time.time()
... strippedgbfreqs2 = wordcount5(sorted_corpus, lower=True, StrippedSorted=True)
... time.time()-start
...
0.9050078392028809
Here, we've reduced runtime by a factor of appr. 8 by not having to sort the corpus and convert the items. Of course the latter is also true when feeding Counter
with this new list, so expectably it's also a bit faster, but it doesn't seem to take advantage of the fact that it is sorted and now takes twice as long as my function where it was 30% faster before:
>>> if 1:
... start = time.time()
... cnt = Counter()
... for word in sorted_corpus:
... cnt[word] += 1
... time.time()-start
... strippedgbfreqs = [(word, cnt[word])for word in cnt]
... time.time()-start
...
1.9455058574676514
2.0096349716186523
Of course, we can use the same logic I used in wordcount5
- incrementing a local counter until we run into a new word and only then storing the last word with the current state of the counter, and resetting the counter to 0 for the next word - only using Counter
as storage, but the inherent efficiency of the Counter
method seems lost, and performance is within the range of my function's for creating a dictionary, with the extra burden of converting to a list of tuples now looking more troublesome than it used to when we were processing the raw corpus:
>>> def countertest():
... start = time.time()
... sortedcnt = Counter()
... c = 0
... length = len(sorted_corpus)
... while c < length:
... wcount = 0
... word = sorted_corpus[c]
... while c < length and sorted_corpus[c] == word:
... wcount+=1
... c+=1
... sortedcnt[word] = wcount
... if c < length:
... word = sorted_corpus[c]
... print time.time()-start
... return sorted([(word, sortedcnt[word]) for word in sortedcnt])
... print time.time()-start
...
>>> strippedbgcnt = countertest()
0.920727014542
1.08029007912
(The similarity of the results is not really surprising since we're in effect disabling Counter
's own methods and abusing it as a store for values obtained with the very same methodology as before.)
Therefore, my question: Is there a more idiomatic way to inform Counter
that its input list is already sorted and to make it thus keep the current key in memory rather than looking it up anew every time it - predictably - encounters the next token of the same word? In other words, is it possible to improve performance on a pre-sorted list further by combining the inherent efficiency of the Counter
/dictionary
class with the obvious benefits of a sorted list, or am I already scratching on a hard limit with .9 seconds for tallying a list of 2M entries?
There probably isn't a whole lot of room for improvement - I get times of around .55 seconds when doing the simplest thing I can think of that still requires iterating through the same list and checking each individual value, and .25 for set(corpus)
without a count, but maybe there's some itertools magic out there that would help to get close to those figures?
(Note: I'm a relative novice to Python and to programming in general, so excuse if I've missed something obvious.)
Another thing, besides the sorting itself, which makes all of my methods above slow, is converting every single one of 2M strings into lowercase and stripping them of whatever punctuation or digits they may include. I have tried before to shortcut that by counting the unprocessed strings and only then converting the results and removing duplicates while adding up their counts, but I must have done something wrong for it made things ever so slightly slower. I therefore reverted to the previous versions, converting everything in the raw corpus, and now can't quite reconstruct what I did there.
If I try it now, I do get an improvement from converting the strings last. I'm still doing it by looping over a list (of results). What I did was write a couple of functions that would between them convert the keys in the output of J.F. Sebastian's winning default_dict method (of format [("word", int), ("Word", int)], ("word2", int),...]
) into lowercase and strip them of punctuation, and collapse the counts for all keys that were left identical after that operation (code below). The advantage is that we're now handling a list of roughly 50k entries as opposed to the > 2M in the corpus. This way I'm now at 1.25 seconds for going from the corpus (as a list) to a case insensitive word count ignoring punctuation marks on my machine, down from about 4.5 with the Counter method and string conversion as a first step. But maybe there's a dictionary-based method for what I'm doing in sum_sorted()
as well?
Code:
def striplast(resultlist, lower_or_Case=False):
"""function for string conversion of the output of any of the `count_words*` methods"""
if lower_or_Case:
strippedresult = sorted([(entry[0].strip(drop), entry[1]) for entry in resultlist])
else:
strippedresult = sorted([(entry[0].lower().strip(drop), entry[1]) for entry in resultlist])
strippedresult = sum_sorted(strippedresult)
return strippedresult
def sum_sorted(inputlist):
"""function for collapsing the counts of entries left identical by striplast()"""
ilindex = 0
freqs = []
currentword = inputlist[0][0]
length = len(inputlist)
while ilindex < length:
wordcount = 0
while ilindex < length and inputlist[ilindex][0] == currentword:
wordcount += inputlist[ilindex][1]
ilindex += 1
if currentword not in ["", " "]:
freqs.append((currentword, wordcount))
if ilindex < length and inputlist[ilindex][0] > currentword:
currentword = inputlist[ilindex][0]
return freqs
def count_words_defaultdict2(words, loc=False):
"""modified version of J.F. Sebastian's winning method, added a final step collapsing
the counts for words identical except for punctuation and digits and case (for the
latter, unless you specify that you're interested in a case-sensitive count by setting
l(ower_)o(r_)c(ase) to True) by means of striplast()."""
d = defaultdict(int)
for w in words:
d[w] += 1
if col=True:
return striplast(sorted(d.items()), lower_or_case=True)
else:
return striplast(sorted(d.items()))
I made some first attempts at using groupy to do the job currently done by sum_sorted()
, and/or striplast()
, but I couldn't quite work out how to trick it into summing [entry[1]]
for a list of entries in count_words
' results sorted by entry[0]
. The closest I got was:
# "i(n)p(ut)list", toylist for testing purposes:
list(groupby(sorted([(entry[0].lower().strip(drop), entry[1]) for entry in iplist])))
# returns:
[(('a', 1), <itertools._grouper object at 0x1031bb290>), (('a', 2), <itertools._grouper object at 0x1031bb250>), (('a', 3), <itertools._grouper object at 0x1031bb210>), (('a', 5), <itertools._grouper object at 0x1031bb2d0>), (('a', 8), <itertools._grouper object at 0x1031bb310>), (('b', 3), <itertools._grouper object at 0x1031bb350>), (('b', 7), <itertools._grouper object at 0x1031bb390>)]
# So what I used instead for striplast() is based on list comprehension:
list(sorted([(entry[0].lower().strip(drop), entry[1]) for entry in iplist]))
# returns:
[('a', 1), ('a', 2), ('a', 3), ('a', 5), ('a', 8), ('b', 3), ('b', 7)]
Given a sorted list of words as you mention, have you tried the traditional Pythonic approach of itertools.groupby
?
from itertools import groupby
some_data = ['a', 'a', 'b', 'c', 'c', 'c']
count = dict( (k, sum(1 for i in v)) for k, v in groupby(some_data) ) # or
count = {k:sum(1 for i in v) for k, v in groupby(some_data)}
# {'a': 2, 'c': 3, 'b': 1}
To answer the question from the title: Counter, dict, defaultdict, OrderedDict are hash-based types: to look up an item they compute a hash for a key and use it to get the item. They even support keys that have no defined order as long as they are hashable i.e., Counter can't take advantage of pre-sorted input.
The measurements show that the sorting of input words takes longer than to count the words using dictionary-based approach and to sort the result combined:
sorted 3.19
count_words_Counter 2.88
count_words_defaultdict 2.45
count_words_dict 2.58
count_words_groupby 3.44
count_words_groupby_sum 3.52
Also the counting of words in already sorted input with groupby()
takes only fraction of the time it takes to sort the input in the first place and faster than dict-based approaches.
def count_words_Counter(words):
return sorted(Counter(words).items())
def count_words_groupby(words):
return [(w, len(list(gr))) for w, gr in groupby(sorted(words))]
def count_words_groupby_sum(words):
return [(w, sum(1 for _ in gr)) for w, gr in groupby(sorted(words))]
def count_words_defaultdict(words):
d = defaultdict(int)
for w in words:
d[w] += 1
return sorted(d.items())
def count_words_dict(words):
d = {}
for w in words:
try:
d[w] += 1
except KeyError:
d[w] = 1
return sorted(d.items())
def _count_words_freqdist(words):
# note: .items() returns words sorted by word frequency (descreasing order)
# (same as `Counter.most_common()`)
# so the code sorts twice (the second time in lexicographical order)
return sorted(nltk.FreqDist(words).items())
To reproduce the results, run this code.
Note: It is 3 times faster if nltk's lazy sequence of words is converted to a list (WORDS = list(nltk.corpus.gutenberg.words())
but relative performance is the same:
sorted 1.22
count_words_Counter 0.86
count_words_defaultdict 0.48
count_words_dict 0.54
count_words_groupby 1.49
count_words_groupby_sum 1.55
The results are similar to Python - Is a dictionary slow to find frequency of each character?.
If you want to normalize the words (remove punctuation, make them lowercase, etc); see answers to What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?. Some examples:
def toascii_letter_lower_genexpr(s, _letter_set=ascii_lowercase):
"""
>>> toascii_letter_lower_genexpr("ABC,-.!def")
'abcdef'
"""
return ''.join(c for c in s.lower() if c in _letter_set)
def toascii_letter_lower_genexpr_set(s, _letter_set=set(ascii_lowercase)):
return ''.join(c for c in s.lower() if c in _letter_set)
def toascii_letter_lower_translate(s,
table=maketrans(ascii_letters, ascii_lowercase * 2),
deletechars=''.join(set(maketrans('', '')) - set(ascii_letters))):
return s.translate(table, deletechars)
def toascii_letter_lower_filter(s, _letter_set=set(ascii_letters)):
return filter(_letter_set.__contains__, s).lower()
To count and normalize the words simultaneously:
def combine_counts(items):
d = defaultdict(int)
for word, count in items:
d[word] += count
return d.iteritems()
def clean_words_in_items(clean_word, items):
return ((clean_word(word), count) for word, count in items)
def normalize_count_words(words):
"""Normalize then count words."""
return count_words_defaultdict(imap(toascii_letter_lower_translate, words))
def count_normalize_words(words):
"""Count then normalize words."""
freqs = count_words_defaultdict(words)
freqs = clean_words_in_items(toascii_letter_lower_translate, freqs)
return sorted(combine_counts(freqs))
Results
I've updated the benchmark to measure various combinations of count_words*()
and toascii*()
functions (5x4 pairs not shown):
toascii_letter_lower_filter 0.954 usec small
toascii_letter_lower_genexpr 2.44 usec small
toascii_letter_lower_genexpr_set 2.19 usec small
toascii_letter_lower_translate 0.633 usec small
toascii_letter_lower_filter 124 usec random 2000
toascii_letter_lower_genexpr 197 usec random 2000
toascii_letter_lower_genexpr_set 121 usec random 2000
toascii_letter_lower_translate 7.73 usec random 2000
sorted 1.28 sec
count_words_Counter 941 msec
count_words_defaultdict 501 msec
count_words_dict 571 msec
count_words_groupby 1.56 sec
count_words_groupby_sum 1.64 sec
count_normalize_words 622 msec
normalize_count_words 2.18 sec
The fastest methods:
normalize words - toascii_letter_lower_translate()
count words (presorted input) - groupby()
-based approach
count words - count_words_defaultdict()
it is faster first to count the words and then to normalize them - count_normalize_words()
Latest version of the code: count-words-performance.py.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With