Count occurrences of a couple of specific words

Tags:

python

I have a list of words, lets say: ["foo", "bar", "baz"] and a large string in which these words may occur.

I now use for every word in the list the "string".count("word") method. This works OK, but seems rather inefficient. For every extra word added to the list the entire string must be iterated over an extra time.

Is their any better method to do this, or should I implement a custom method which iterates over the large string a single time, checking for each character if one of the words in the list has been reached?

To be clear:

I want the number of occurrences per word in the list.
The string to search in is different each time and consists of about 10000 chars
The list of words is constant
The words in the list of words can contain whitespace

467

asked Feb 29 '12 11:02

zeebonk

2 Answers

Make a dict-typed frequency table for your words, then iterate over the words in your string.

Click to copy

vocab = ["foo", "bar", "baz"]
s = "foo bar baz bar quux foo bla bla"

wordcount = dict((x,0) for x in vocab)
for w in re.findall(r"\w+", s):
    if w in wordcount:
        wordcount[w] += 1

Edit: if the "words" in your list contain whitespace, you can instead build an RE out of them:

Click to copy

from collections import Counter

vocab = ["foo bar", "baz"]
r = re.compile("|".join(r"\b%s\b" % w for w in vocab))
wordcount = Counter(re.findall(r, s))

Explanation: this builds the RE r'\bfoo bar\b|\bbaz\b' from the vocabulary. findall then finds the list ['baz', 'foo bar'] and the Counter (Python 2.7+) counts the occurrence of each distinct element in it. Watch out that your list of words should not contain characters that are special to REs, such as ()[]\.

135

answered Sep 22 '22 01:09

Fred Foo

Presuming the words need to be found separately (that is, you want to count words as made by str.split()):

Edit: as suggested in the comments, a Counter is a good option here:

Click to copy

from collections import Counter

def count_many(needles, haystack):
    count = Counter(haystack.split())
    return {key: count[key] for key in count if key in needles}

Which runs as so:

Click to copy

count_many(["foo", "bar", "baz"], "testing somefoothing foo bar baz bax foo foo foo bar bar test bar test")
{'baz': 1, 'foo': 4, 'bar': 4}

Note that in Python <= 2.6(?) you will need to use return dict((key, count[key]) for key in count if key in needles) due to the lack of dict comprehensions.

Of course, another option is to simply return the whole Counter object and only get the values you need when you need them, as it may not be a problem to have the extra values, depending on the situation.

Old answer:

Click to copy

from collections import defaultdict

def count_many(needles, haystack):
    count = defaultdict(int)
    for word in haystack.split():
        if word in needles:
            count[word] += 1
    return count

Which results in:

Click to copy

count_many(["foo", "bar", "baz"], "testing somefoothing foo bar baz bax foo foo foo bar bar test bar test")
defaultdict(<class 'int'>, {'baz': 1, 'foo': 4, 'bar': 4})

If you greatly object to getting a defaultdict back (which you shouldn't, as it functions exactly the same as a dict when accessing), then you can do return dict(count) instead to get a normal dictionary.

answered Sep 18 '22 01:09

Gareth Latty

Related questions
                            
                                Search for a key in a nested Python dictionary
                            
                                Python: String of 1s and 0s -> binary file
                            
                                Dedupe and sort a list in Python 2.2
                            
                                python || backup statement
                            
                                NumPy arrays with SQLite
                            
                                Load/reload a portion of code in Python without restarting main script
                            
                                Using Django ORM get_or_create with multiple databases
                            
                                Python topological sort using lists indicating edges
                            
                                Can I use a dynamic mapping to unpack keyword arguments in Python?
                            
                                Will the function in python for loop be executed multiple times?
                            
                                How to reverse geocode serverside with python, json and google maps?
                            
                                Matplotlib animations - how to export them to a format to use in a presentation?
                            
                                LXML and XSL document() Function
                            
                                Python FileCookieJar.save() issue
                            
                                Store exception body in variable
                            
                                How to extract movie title from file name
                            
                                Combined list and dict comprehension
                            
                                Dynamically get dict elements via getattr?
                            
                                Python algorithm of counting occurrence of specific word in csv
                            
                                Source code for Python's modules

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count occurrences of a couple of specific words

Tags:

python

zeebonk

People also ask

2 Answers

Fred Foo

Gareth Latty

Recent Activity

Donate For Us