In LUCENE-5472, Lucene was changed to throw an error if a term is too long, rather than just logging a message. This error states that SOLR doesn't accept token larger than 32766
Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[10, 10, 70, 111, 117, 110, 100, 32, 116, 104, 105, 115, 32, 111, 110, 32, 116, 104, 101, 32, 119, 101, 98, 32, 104, 111, 112, 101, 32, 116]...', original message: bytes can be at most 32766 in length; got 43225
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:671)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
... 54 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 43225
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
Trying to fix this issue, I added two filters in the schema (in bold):
<field name="text" type="text_en_splitting" termPositions="true" termOffsets="true" termVectors="true" indexed="true" required="false" stored="true"/>
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
**<filter class="solr.TruncateTokenFilterFactory" prefixLength="32700"/>
<filter class="solr.LengthFilterFactory" min="2" max="32700" />**
</analyzer>
</fieldType>
Since the error is still the same (this make me think that the filters are not correctly set up, maybe? ) Restarting the server is the key thanks to Mr. Bashetti
The question is which one is better: LengthFilterFactory
or TruncateTokenFilterFactory
? And is it right assuming that a byte is a character (since the filter should remove 'unusual' characters ? )
Thank you!
The error says that "SOLR doesn't accept token larger than 32766"
The issue was because of you have used the String fieldType earlier for you field text and its the same issue you are getting currently after changing the field type because you have not restarted the solr server after the changes.
I don't think there is any need of adding TruncateTokenFilterFactory
or LengthFilterFactory
.
But that is left with you and your whats requirement you have.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With