Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a Token from a Lucene TokenStream?

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream.

The worst part is that I'm looking at the comments in the JavaDocs that address my question.

http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29

Somehow, an AttributeSource is supposed to be used, rather than Tokens. I'm totally at a loss.

Can anyone explain how to get token-like information from a TokenStream?

like image 820
Eric Wilson Avatar asked Apr 14 '10 14:04

Eric Wilson


1 Answers

Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader); OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class); TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);  while (tokenStream.incrementToken()) {     int startOffset = offsetAttribute.startOffset();     int endOffset = offsetAttribute.endOffset();     String term = termAttribute.term(); } 

Edit: The new way

According to Donotello, TermAttribute has been deprecated in favor of CharTermAttribute. According to jpountz (and Lucene's documentation), addAttribute is more desirable than getAttribute.

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader); OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class); CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);  tokenStream.reset(); while (tokenStream.incrementToken()) {     int startOffset = offsetAttribute.startOffset();     int endOffset = offsetAttribute.endOffset();     String term = charTermAttribute.toString(); } 
like image 62
Adam Paynter Avatar answered Sep 22 '22 18:09

Adam Paynter