I'm using a snowball analyzer to stem the titles of multiple documents. Everything works well, but their are some quirks.
Example:
A search for "valv", "valve", or "valves" returns the same number of results. This makes sense since the snowball analyzer reduces everything down to "valv".
I run into problems when using a wildcard. A search for "valve*" or "valves*" does not return any results. Searching for "valv*" works as expected.
I understand why this is happening, but I don't know how to fix it.
I thought about writing an analyzer that stores the stemmed and non-stemmed tokens. Basically applying two analyzers and combining the two token streams. But I'm not sure if this is a practical solution.
I also thought about using the AnalyzingQueryParser, but I don't know how to apply this to a multifield query. Also, the using AnalyzingQueryParser would return results for "valve" when searching for "valves*" and that's not the expected behavior.
Is there a "preferred" way of utilizing both wildcards and stemming algorithms?
I used 2 different approach to solve this before
Use two fields, one that contain stemmed terms, the other one containing terms generated by say, the StandardAnalyzer
. When you parse the search query if its a wildcard search in the "standard" field, if not use the field with stemmed terms. This may be harder to use if you have the user input their queries directly in the Lucene's QueryParser.
Write a custom analyzer and index overlapping tokens. It basically consist of indexing the original term and the stem at the same position in the index using the PositionIncrementAttribute
. You can look into SynonymFilter to get some example of how to use the PositionIncrementAttribute
correctly.
I Prefer solution #2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With