Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding a document to the index in SOLR: Document contains at least one immense term

Tags:

solr

I am adding (by a Java program) for indexing, a document in SOLR index, but after add(inputDoc) method there is an exception. The log in solr web interface contains the following:

Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[99, 111, 112, 101, 114, 116, 105, 110, 97, 32, 105, 110, 102, 111, 114, 109, 97, 122, 105, 111, 110, 105, 32, 113, 117, 101, 115, 116, 111, 32]...', original message: bytes can be at most 32766 in length; got 226781
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:457)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:240)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
    ... 40 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 226781
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:663)
    ... 47 more

Please what should I do to solve this problem?

like image 618
rolando Avatar asked Apr 04 '15 10:04

rolando


People also ask

What is indexing in Solr?

By adding content to an index, we make it searchable by Solr. A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.

What are Solr documents?

In Solr, a Document is the unit of search and index. An index consists of one or more Documents, and a Document consists of one or more Fields. In database terminology, a Document corresponds to a table row, and a Field corresponds to a table column.

What is stored in Solr?

Solr in Action In general, your documents may contain fields that aren't useful from a search perspective but are still useful for displaying search results. In Solr, these are called stored fields.


2 Answers

I had the same problem as yours, finally I solved my problem. Please check the type of your "text" field, I suspect it must be "strings".

You can find it in the managed-schema of the core:

<field name="text" type="strings"/>

Or you can go to Solr Admin, access: http://localhost:8983/solr/CORE_NAME/schema/fieldtypes?wt=json and then search for "text", if it is something like the follow, you know you defined your "text" field as strings type:

  {
  "name":"strings",
  "class":"solr.StrField",
  "multiValued":true,
  "sortMissingLast":true,
  "fields":["text"],
  "dynamicFields":["*_ss"]},

Then my solution works for you, you can change the type from "strings" to "text_general" in managed-schema. (make sure type of "text" in schema.xml is also "text_general")

   <field name="text" type="text_general">

This will solve your problem. strings is string field, but text_general is text field.

like image 130
Bernice Avatar answered Sep 28 '22 11:09

Bernice


You probably met what is described in LUCENE-5472 [1]. There, Lucene throws an error if a term is too long. You could:

  • use (in index analyzer), a LengthFilterFactory [2] in order to filter out those tokens that don't fall withing a requested length range

  • use (in index analyzer), a TruncateTokenFilterFactory [3] for fixing the max length of indexed tokens

  • use a custom UpdateRequestProcessor, but this actually depends on your context

[1] https://issues.apache.org/jira/browse/LUCENE-5472
[2] https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
[3] https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.TruncateTokenFilterFactory [4] https://wiki.apache.org/solr/UpdateRequestProcessor

like image 21
Andrea Avatar answered Sep 28 '22 11:09

Andrea