create and query a n-gram index with lucene

Question

I would like to build an index containing n-grams of each line from my input file, which looks like this:

Segeln bei den Olympischen Sommerspielen
Erdmond
Olympische Spiele
Turnen bei den Olympischen Sommerspielen
Tennis bei den Olympischen Sommerspielen
Geschichte der Astronomie

I need the n-grams because I would like to search in the index but I have to assume that there are many typing errors in the search-term. For example I would like to find "Geschichte der Astronomie" if I search with the term "schichte astrologie". It would be even better if it could give me a list of the best possible matches, lets say the best 10 matches, no matter how bad they maybe are. I hope you can point me in the right direction if there would be a better way to achieve this, than with n-grams, or that you have a hint how to create the index and how to query it. I would be very happy to have an example that helps me to understand how to do it. I currently use lucene 4.3.1. I would prefer to implement it in java and not built the index on the command line.

femtoRgon · Accepted Answer

There are a lot of different ways to approach to this problem, and Lucene has a lot of tools to help with them. N-Grams are probably not the best approach in this situation, to my mind.

Stemmers to reduce terms to their root, based on linguistic rules (ex. matching "fishing" "fished" and "fish) (I don't claim to know how GermanStemmer handles the "ge" prefix, but that would be a good example of something that a stemmer might deal with)
Synonym Filter can handle specific known synonyms you want to recognize (ex. "astrology" = "astronomy")
Fuzzy queries can be used to obtain matches with low edit distances.

Among other possibilities.

As far as implementing on NGrams, NGramTokenizer would be the correct tokenizer for that.

create and query a n-gram index with lucene

Tags:

java

indexing

search

lucene

tadumtada

1 Answers

femtoRgon

Recent Activity

Donate For Us

create and query a n-gram index with lucene

Tags:

java

indexing

search

lucene

tadumtada

1 Answers

femtoRgon

Related questions

Recent Activity

Donate For Us