Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use StandardTokenizer from lucene 5.x.x

There are a lot of examples that show how to use the StandardTokenizer like this:

TokenStream tokenStream = new StandardTokenizer(
            Version.LUCENE_36, new StringReader(input));

But in newer Lucene versions this constructor is unavailable. The new constructor looks like this:

StandardTokenizer(AttributeFactory factory)

What is the role of this AttributeFactory and how can i tokenize a String in newer versions of Lucene?

like image 642
samy Avatar asked May 29 '15 07:05

samy


1 Answers

The AttributeFactory creates AttributeImpls which are sources for Attributes. Attributes govern the behavior of the TokenStream, which is the underlying mechanism used for reading/tracking the data stream for the StandardTokenizer.

Little has changed from 4.x to 5.x with respect to the AttributeFactory - in both versions, you can create a StandardTokenizer with an AttributeFactory if you'd like, or if you don't specify one, then AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY will ultimately end up being used.

The big difference was that you could also pass in a Reader for the input stream as part of the constructor. This means that in 4.x, you would have to create a new StreamTokenizer for each input stream you wanted to process, which would in turn have to re-initialize attributes from the AttributeFactory.

I'm no Lucene dev, but my guess is that this is just a restructure to encourage reuse of the attributes across the reading of multiple streams. If you take a look at the internals of the TokenStream and the default AttributesFactory implementation, there is a LOT of reflection involved with creating and setting attributes. If I had to guess, the StreamTokenizer constructor that takes a reader was just removed to encourage reuse of the tokenizer and its attributes because initialization of those attributes is relatively expensive.

EDIT

Adding a long-overdue example - sorry for not leading with this:

// Define your attribute factory (or use the default) - same between 4.x and 5.x
AttributeFactory factory = AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;

// Create the tokenizer and prepare it for reading
//  Lucene 4.x
StandardTokenizer tokenizer = 
        new StandardTokenizer(factory, new StringReader("Tokenize me!"));
tokenizer.reset();
//  Lucene 5.x
StandardTokenizer tokenizer = new StandardTokenizer(factory);
tokenizer.setReader(new StringReader("Tokenizer me!"));
tokenizer.reset();

// Then process tokens - same between 4.x and 5.x
// NOTE: Here I'm adding a single expected attribute to handle string tokens,
//  but you would probably want to do something more meaningful/elegant
CharTermAttribute attr = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken()) {
    // Grab the term
    String term = attr.toString();

    // Do something crazy...
}
like image 70
rusnyder Avatar answered Nov 16 '22 05:11

rusnyder