I am new to Lucene and I would really appreciate an example on how to have bigrams and trigrams tokens in the index.
I'm using the following code and I have modified it to be able to calculate the term frequencies and weight but I need to do that to bigrams and trigrams also. I can't see the tokenization part! I searched online and some of the suggested classes do not exist in Lucene 3.4.0 as they have been deprecated.
Any suggestions please?
Thanks, Moe
EDIT: --------------------------------
Now I'm using the NGramTokenFilter as mbonaci suggested. This is part of the code where I Tokenize a text to get the uni, bi and trigrams. But it's being done on a character rather than word level.
Instead of:
[H][e][l][l][o][HE][EL] etc.
I'm looking for: [Hello][World][Hello World]
int min =1;
int max =3;
WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_34);
String text ="hello my world";
TokenStream tokenStream = analyzer.tokenStream("Data", new StringReader(text));
NGramTokenFilter myfilter = new NGramTokenFilter(tokenStream,min,max);
OffsetAttribute offsetAttribute2 = myfilter.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute2 = myfilter.addAttribute(CharTermAttribute.class)
while (myfilter.incrementToken()) {
int startOffset = offsetAttribute2.startOffset();
int endOffset = offsetAttribute2.endOffset();
String term = charTermAttribute2.toString();
System.out.println(term);
};
you need to look at shingles. That article shows how to do it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With