Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create and query a n-gram index with lucene

I would like to build an index containing n-grams of each line from my input file, which looks like this:

Segeln bei den Olympischen Sommerspielen
Erdmond
Olympische Spiele
Turnen bei den Olympischen Sommerspielen
Tennis bei den Olympischen Sommerspielen
Geschichte der Astronomie

I need the n-grams because I would like to search in the index but I have to assume that there are many typing errors in the search-term. For example I would like to find "Geschichte der Astronomie" if I search with the term "schichte astrologie". It would be even better if it could give me a list of the best possible matches, lets say the best 10 matches, no matter how bad they maybe are. I hope you can point me in the right direction if there would be a better way to achieve this, than with n-grams, or that you have a hint how to create the index and how to query it. I would be very happy to have an example that helps me to understand how to do it. I currently use lucene 4.3.1. I would prefer to implement it in java and not built the index on the command line.

like image 580
tadumtada Avatar asked Oct 21 '22 00:10

tadumtada


1 Answers

There are a lot of different ways to approach to this problem, and Lucene has a lot of tools to help with them. N-Grams are probably not the best approach in this situation, to my mind.

  • Stemmers to reduce terms to their root, based on linguistic rules (ex. matching "fishing" "fished" and "fish) (I don't claim to know how GermanStemmer handles the "ge" prefix, but that would be a good example of something that a stemmer might deal with)
  • Synonym Filter can handle specific known synonyms you want to recognize (ex. "astrology" = "astronomy")
  • Fuzzy queries can be used to obtain matches with low edit distances.

Among other possibilities.

As far as implementing on NGrams, NGramTokenizer would be the correct tokenizer for that.

like image 102
femtoRgon Avatar answered Oct 27 '22 09:10

femtoRgon