There are a lot of examples that show how to use the StandardTokenizer like this:
TokenStream tokenStream = new StandardTokenizer(
Version.LUCENE_36, new StringReader(input));
But in newer Lucene versions this constructor is unavailable. The new constructor looks like this:
StandardTokenizer(AttributeFactory factory)
What is the role of this AttributeFactory and how can i tokenize a String in newer versions of Lucene?
The AttributeFactory
creates AttributeImpl
s which are sources for Attribute
s. Attributes govern the behavior of the TokenStream
, which is the underlying mechanism used for reading/tracking the data stream for the StandardTokenizer
.
Little has changed from 4.x to 5.x with respect to the AttributeFactory
- in both versions, you can create a StandardTokenizer
with an AttributeFactory
if you'd like, or if you don't specify one, then AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY
will ultimately end up being used.
The big difference was that you could also pass in a Reader
for the input stream as part of the constructor. This means that in 4.x, you would have to create a new StreamTokenizer for each input stream you wanted to process, which would in turn have to re-initialize attributes from the AttributeFactory
.
I'm no Lucene dev, but my guess is that this is just a restructure to encourage reuse of the attributes across the reading of multiple streams. If you take a look at the internals of the TokenStream and the default AttributesFactory implementation, there is a LOT of reflection involved with creating and setting attributes. If I had to guess, the StreamTokenizer
constructor that takes a reader was just removed to encourage reuse of the tokenizer and its attributes because initialization of those attributes is relatively expensive.
EDIT
Adding a long-overdue example - sorry for not leading with this:
// Define your attribute factory (or use the default) - same between 4.x and 5.x
AttributeFactory factory = AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
// Create the tokenizer and prepare it for reading
// Lucene 4.x
StandardTokenizer tokenizer =
new StandardTokenizer(factory, new StringReader("Tokenize me!"));
tokenizer.reset();
// Lucene 5.x
StandardTokenizer tokenizer = new StandardTokenizer(factory);
tokenizer.setReader(new StringReader("Tokenizer me!"));
tokenizer.reset();
// Then process tokens - same between 4.x and 5.x
// NOTE: Here I'm adding a single expected attribute to handle string tokens,
// but you would probably want to do something more meaningful/elegant
CharTermAttribute attr = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken()) {
// Grab the term
String term = attr.toString();
// Do something crazy...
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With