I would like to build an index containing n-grams of each line from my input file, which looks like this:
Segeln bei den Olympischen Sommerspielen
Erdmond
Olympische Spiele
Turnen bei den Olympischen Sommerspielen
Tennis bei den Olympischen Sommerspielen
Geschichte der Astronomie
I need the n-grams because I would like to search in the index but I have to assume that there are many typing errors in the search-term. For example I would like to find "Geschichte der Astronomie" if I search with the term "schichte astrologie". It would be even better if it could give me a list of the best possible matches, lets say the best 10 matches, no matter how bad they maybe are. I hope you can point me in the right direction if there would be a better way to achieve this, than with n-grams, or that you have a hint how to create the index and how to query it. I would be very happy to have an example that helps me to understand how to do it. I currently use lucene 4.3.1. I would prefer to implement it in java and not built the index on the command line.
There are a lot of different ways to approach to this problem, and Lucene has a lot of tools to help with them. N-Grams are probably not the best approach in this situation, to my mind.
GermanStemmer
handles the "ge" prefix, but that would be a good example of something that a stemmer might deal with)Among other possibilities.
As far as implementing on NGrams, NGramTokenizer
would be the correct tokenizer for that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With