Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with "Document contains at least one immense term" in SOLR?

Tags:

solr

lucene

In LUCENE-5472, Lucene was changed to throw an error if a term is too long, rather than just logging a message. This error states that SOLR doesn't accept token larger than 32766

Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[10, 10, 70, 111, 117, 110, 100, 32, 116, 104, 105, 115, 32, 111, 110, 32, 116, 104, 101, 32, 119, 101, 98, 32, 104, 111, 112, 101, 32, 116]...', original message: bytes can be at most 32766 in length; got 43225
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:671)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
    ... 54 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 43225
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)

Trying to fix this issue, I added two filters in the schema (in bold):

<field name="text" type="text_en_splitting" termPositions="true" termOffsets="true" termVectors="true" indexed="true" required="false" stored="true"/>
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
**<filter class="solr.TruncateTokenFilterFactory" prefixLength="32700"/>
<filter class="solr.LengthFilterFactory" min="2" max="32700" />**
</analyzer>
</fieldType>

Since the error is still the same (this make me think that the filters are not correctly set up, maybe? ) Restarting the server is the key thanks to Mr. Bashetti

The question is which one is better: LengthFilterFactory or TruncateTokenFilterFactory ? And is it right assuming that a byte is a character (since the filter should remove 'unusual' characters ? ) Thank you!

like image 717
salvob Avatar asked May 06 '16 10:05

salvob


1 Answers

The error says that "SOLR doesn't accept token larger than 32766"

The issue was because of you have used the String fieldType earlier for you field text and its the same issue you are getting currently after changing the field type because you have not restarted the solr server after the changes.

I don't think there is any need of adding TruncateTokenFilterFactory or LengthFilterFactory.

But that is left with you and your whats requirement you have.

like image 87
Abhijit Bashetti Avatar answered Nov 03 '22 04:11

Abhijit Bashetti