Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently filter a string against a long list of words in Python/Django?

Stackoverflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions.

I want to do something similar in my Django site. What is the best way to filter a string (the question title in this case) against a long list of words in Python? Any libraries that would enable me to do that efficiently?

like image 269
Continuation Avatar asked Sep 04 '10 06:09

Continuation


2 Answers

You could do this very simply using the set and string functionality in Python and see how it performs (premature optimisation being the root of all evil!):

common_words = frozenset(("if", "but", "and", "the", "when", "use", "to", "for"))
title = "When to use Python for web applications"
title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
print(keywords)
like image 175
Gareth Williams Avatar answered Oct 22 '22 10:10

Gareth Williams


I think a much simpler solution and still reasonably fast is to use sqlite and regular expressions.

Put the long list of words in an sqlite table and build a b-tree index. This gives you log(n) time exists queries. Split the smaller string with a regular expression and loop over the words running an exists query for each of them.

You can stem the words first with the porter stemmer from nltk.

like image 22
Kevin Avatar answered Oct 22 '22 12:10

Kevin