UTF8 encoding is longer than the max length 32766

Question

I've upgraded my Elasticsearch cluster from 1.1 to 1.2 and I have errors when indexing a somewhat big string.

{
  "error": "IllegalArgumentException[Document contains at least one immense term in field=\"response_body\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[7b 22 58 48 49 5f 48 6f 74 65 6c 41 76 61 69 6c 52 53 22 3a 7b 22 6d 73 67 56 65 72 73 69]...']",
  "status": 500
}

The mapping of the index :

{
  "template": "partner_requests-*",
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "request": {
      "properties": {
        "asn_id": { "index": "not_analyzed", "type": "string" },
        "search_id": { "index": "not_analyzed", "type": "string" },
        "partner": { "index": "not_analyzed", "type": "string" },
        "start": { "type": "date" },
        "duration": { "type": "float" },
        "request_method": { "index": "not_analyzed", "type": "string" },
        "request_url": { "index": "not_analyzed", "type": "string" },
        "request_body": { "index": "not_analyzed", "type": "string" },
        "response_status": { "type": "integer" },
        "response_body": { "index": "not_analyzed", "type": "string" }
      }
    }
  }
}

I've searched the documentation and didn't find anything related to a maximum field size. According to the core types section I don't understand why I should "correct the analyzer" for a not_analyzed field.

John Petrone · Accepted Answer

So you are running into an issue with the maximum size for a single term. When you set a field to not_analyzed it will treat it as one single term. The maximum size for a single term in the underlying Lucene index is 32766 bytes, which is I believe hard coded.

Your two primary options are to either change the type to binary or to continue to use string but set the index type to "no".

Mikael · Answer

If you really want not_analyzed on on the property because you want to do some exact filtering then you can use "ignore_above": 256

Here is an example of how I use it in php:

    'mapping'    => [
        'type'   => 'multi_field',
        'path'   => 'full',
        'fields' => [
            '{name}' => [
                'type'     => 'string',
                'index'    => 'analyzed',
                'analyzer' => 'standard',
            ],
            'raw' => [
                'type'         => 'string',
                'index'        => 'not_analyzed',
                'ignore_above' => 256,
            ],
        ],
    ],

In your case you probably want to do as John Petrone told you and set "index": "no" but for anyone else finding this question after, like me, searching on that Exception then your options are:

set "index": "no"
set "index": "analyze"
set "index": "not_analyzed" and "ignore_above": 256

It depends on if and how you want to filter on that property.

Jasper Huzen · Answer

There is a better option than the one John posted. Because with that solution you can't search on the value anymore.

Back to the problem:

The problem is that by default field values will be used as a single term (complete string). If that term/string is longer than the 32766 bytes it can't be stored in Lucene .

Older versions of Lucene only registers a warning when terms are too long (and ignore the value). Newer versions throws an Exception. See bugfix: https://issues.apache.org/jira/browse/LUCENE-5472

Solution:

The best option is to define a (custom) analyzer on the field with the long string value. The analyzer can split out the long string in smaller strings/terms. That will fix the problem of too long terms.

Don't forget to also add an analyzer to the "_all" field if you are using that functionality.

Analyzers can be tested with the REST api. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

UTF8 encoding is longer than the max length 32766

Tags:

elasticsearch

jlecour

3 Answers

John Petrone

Mikael

Jasper Huzen

Recent Activity

Donate For Us

UTF8 encoding is longer than the max length 32766

Tags:

elasticsearch

jlecour

3 Answers

John Petrone

Mikael

Jasper Huzen

Related questions

Recent Activity

Donate For Us