Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: best/efficient way of finding a list of words in a text?

Tags:

python

regex

I have a list of approximately 300 words and a huge amount of text that I want to scan to know how many times each word appears.

I am using the re module from python:

for word in list_word:
    search = re.compile(r"""(\s|,)(%s).?(\s|,|\.|\))""" % word)
    occurrences = search.subn("", text)[1]

but I want to know if there is a more efficient or more elegant way of doing this?

like image 303
Mermoz Avatar asked Jul 30 '10 14:07

Mermoz


2 Answers

If you have a huge amount of text, I wouldn't use regexps in this case but simply split text:

words = {"this": 0, "that": 0}
for w in text.split():
  if w in words:
    words[w] += 1

words will give you the frequency for each word

like image 92
Adam Schmideg Avatar answered Oct 11 '22 13:10

Adam Schmideg


Try stripping all the punctuation from your text and then splitting on whitespace. Then simply do

for word in list_word:
    occurence = strippedText.count(word)

Or if you're using python 3.0 I think you could do:

occurences = {word: strippedText.count(word) for word in list_word}
like image 25
jacobangel Avatar answered Oct 11 '22 15:10

jacobangel