Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Lucene TokenStream contract violation

Tags:

java

lucene

Using Appache Lucene TokenStream to remove stopwords causes an error:

TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.

I use this code:

public static String removeStopWords(String string) throws IOException {
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_47, new StringReader(string));
    TokenFilter tokenFilter = new StandardFilter(Version.LUCENE_47, tokenStream);
    TokenStream stopFilter = new StopFilter(Version.LUCENE_47, tokenFilter, StandardAnalyzer.STOP_WORDS_SET);
    StringBuilder stringBuilder = new StringBuilder();

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

    while(stopFilter.incrementToken()) {
        if(stringBuilder.length() > 0 ) {
            stringBuilder.append(" ");
        }

        stringBuilder.append(token.toString());
    }

    stopFilter.end();
    stopFilter.close();

    return stringBuilder.toString();
}

But as you can see i never call reset() or close().

So why am i getting this error?

like image 372
Mulgard Avatar asked May 29 '14 10:05

Mulgard


1 Answers

i never call reset() or close().

Well, that is your problem. If you care to read TokenStream javadoc, you would find the following:

The workflow of the new TokenStream API is as follows:

  1. Instantiation of TokenStream/TokenFilters which add/get attributes to/from the AttributeSource.
  2. The consumer calls TokenStream#reset()
  3. ...

I only had to add one line with reset() to your code and it worked.

...    
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();   // I added this 
while(stopFilter.incrementToken()) {
...
like image 93
mindas Avatar answered Oct 22 '22 22:10

mindas