I have a site which is searchable using Lucene. I've noticed from logs that users sometimes don't find what they're looking for because they enter a singular term, but only the plural version of that term is used on the site. I would like the search to find uses of other forms of a word as well. This is a problem that I'm sure has been solved many times over, so what are the best practices for this?
Please note: this site only has English content.
Some approaches I've thought of:
I'm thinking of doing some kind of combination of the first two approaches, but I'm not sure where to find a thesaurus file (or what it's called, as "thesaurus" isn't quite right, but neither is "dictionary").
Consider including the PorterStemFilter
in your analysis pipeline. Be sure to perform the same analysis on queries that is used when building the index.
I've also used the Lancaster stemming algorithm with good results. Using the PorterStemFilter
as a guide, it is easy to integrate with Lucene.
Word stemming works OK for English, however for languages where word stemming is nearly impossible (like mine) option #1 is viable. I know of at least one such implementation for my language (Icelandic) for Lucene that seems to work very well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With