Python: best/efficient way of finding a list of words in a text?

Question

I have a list of approximately 300 words and a huge amount of text that I want to scan to know how many times each word appears.

I am using the re module from python:

for word in list_word:
    search = re.compile(r"""(\s|,)(%s).?(\s|,|\.|\))""" % word)
    occurrences = search.subn("", text)[1]

but I want to know if there is a more efficient or more elegant way of doing this?

Adam Schmideg · Accepted Answer

If you have a huge amount of text, I wouldn't use regexps in this case but simply split text:

words = {"this": 0, "that": 0}
for w in text.split():
  if w in words:
    words[w] += 1

words will give you the frequency for each word

jacobangel · Answer

Try stripping all the punctuation from your text and then splitting on whitespace. Then simply do

for word in list_word:
    occurence = strippedText.count(word)

Or if you're using python 3.0 I think you could do:

occurences = {word: strippedText.count(word) for word in list_word}

Donate For Us