Stackoverflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions.
I want to do something similar in my Django site. What is the best way to filter a string (the question title in this case) against a long list of words in Python? Any libraries that would enable me to do that efficiently?
You could do this very simply using the set and string functionality in Python and see how it performs (premature optimisation being the root of all evil!):
common_words = frozenset(("if", "but", "and", "the", "when", "use", "to", "for"))
title = "When to use Python for web applications"
title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
print(keywords)
I think a much simpler solution and still reasonably fast is to use sqlite and regular expressions.
Put the long list of words in an sqlite table and build a b-tree index. This gives you log(n) time exists queries. Split the smaller string with a regular expression and loop over the words running an exists query for each of them.
You can stem the words first with the porter stemmer from nltk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With