Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to rename a nested field containing dots with elasticsearch rename processor and ingest pipeline

Tags:

I have a field in elasticsearch (5.5.1) which I need to rename because the name contains a '.' and it is causing various problems. The field I want to rename is nested inside another field.

I am trying to use a Rename Processor in an Ingest Pipeline to do a Reindex as described here: https://stackoverflow.com/a/43142634/5114

Here is my pipeline simulation request (you can copy this verbatim into the Dev Tools utility in Kibana to test it):

POST _ingest/pipeline/_simulate
{
    "pipeline" : {
        "description": "rename nested fields to remove dot",
            "processors": [
            {
                "rename" : {
                    "field" : "message.message.group1",
                    "target_field" : "message_group1"
                }
            },
            {
                "rename" : {
                    "field" : "message.message.group2",
                    "target_field" : "message.message_group2"
                }
            }
            ]
    },
    "docs":[
        {
            "_type": "status",
            "_id": "1509533940000-m1-bfd7183bf036bd346a0bcf2540c05a70fbc4d69e",
            "_version": 5,
            "_score": null,
            "_source": {
                "message": {
                    "_job-id": "AV8wHJEaa4J0sFOfcZI5",
                    "message.group1": 0,
                    "message.group2": "foo"
                },
                "timestamp": 1509533940000
            }
        }
    ]
}

The problem is that I get an error when trying to use my pipeline:

{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "exception",
            "reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [message.message.group1] doesn't exist",
            "header": {
              "processor_type": "rename"
            }
          }
        ],
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [message.message.group1] doesn't exist",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "java.lang.IllegalArgumentException: field [message.message.group1] doesn't exist",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "field [message.message.group1] doesn't exist"
          }
        },
        "header": {
          "processor_type": "rename"
        }
      }
    }
  ]
}

I think the problem is caused by the field "message.group1" being inside another field ("message"). I'm not sure how to refer to the field I want in the context of the processor. It seems that there could be ambiguity between cases of nested fields, fields containing dots and nested fields containing dots.

I'm looking for the correct way to reference these fields, or if Elasticsearch can not do what I want, confirmation that this is not possible. If Elasticsearch can do this, then it will probably go very fast, else I have to write an external script to pull the documents, transform them, and re-save them to the new index.

like image 549
Mnebuerquo Avatar asked Nov 01 '17 16:11

Mnebuerquo


2 Answers

Ok, investigating in the Elasticsearch code, I think I know why this won't work.

First we look at the Elasticsearch Rename Processor: https://github.com/elastic/elasticsearch/blob/9eff18374d68355f6acb58940a796268c9b6f2de/modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RenameProcessor.java#L76-L84

Object value = document.getFieldValue(field, Object.class);
document.removeField(field);
try {
    document.setFieldValue(targetField, value);
} catch (Exception e) {
    // setting the value back to the original field shouldn't as we just fetched it from that field:
    document.setFieldValue(field, value);
    throw e;
}

What this is doing is looking for the field to rename, getting its value, then removing the field and adding a new field with the same value but with the new name.

Now we look at what happens in document.getFieldValue: https://github.com/elastic/elasticsearch/blob/9eff18374d68355f6acb58940a796268c9b6f2de/core/src/main/java/org/elasticsearch/ingest/IngestDocument.java#L101-L108

public <T> T getFieldValue(String path, Class<T> clazz) {
    FieldPath fieldPath = new FieldPath(path);
    Object context = fieldPath.initialContext;
    for (String pathElement : fieldPath.pathElements) {
        context = resolve(pathElement, path, context);
    }
    return cast(path, context, clazz);
}

Notice it uses a FieldPath object to represent the path to the field in the document.

Now look at how the FieldPath represents the path: https://github.com/elastic/elasticsearch/blob/9eff18374d68355f6acb58940a796268c9b6f2de/core/src/main/java/org/elasticsearch/ingest/IngestDocument.java#L688

this.pathElements = newPath.split("\\.");

This is splitting the path on any "." character, because that is the delimiter between path elements in field names.

The problem is that the source document has a field named "message.group1", so we need to be able to reference that. Just splitting the path on "." does not account for field names containing a "." in the name. We would need a syntax more like javascript for that, where we could use brackets and quotes to make the dot mean something different.

If the source documents were all transformed so that a "." in the field name would turn that field into an object before saving, then this path scheme would work. But with source documents having field names containing "." we can not reference them in certain contexts.

To solve my problem and reindex my index, I wrote a python script which pulled a batch of documents, transformed them and bulk inserted them in a new index. This is basically what the Elasticsearch reindex api does, but I did it in python instead.

like image 182
Mnebuerquo Avatar answered Sep 19 '22 12:09

Mnebuerquo


More than two year later, I come across the same issue. You can manage to have your dotted-properties expanded to real nested objects with the the dot_expander processor.

Expands a field with dots into an object field. This processor allows fields with dots in the name to be accessible by other processors in the pipeline. Otherwise these fields can’t be accessed by any processor

Issue 37507 on Elasticsearch's Github pointed me in the right direction.

like image 26
kheraud Avatar answered Sep 22 '22 12:09

kheraud