Elasticsearch fielddata - should I use it?

Question

Given an index with documents that have a brand property, we need to create a term aggregation that is case insensitive.

Index definition

Please note that the use of fielddata

PUT demo_products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "brand": {
          "type": "text",
          "analyzer": "my_custom_analyzer",
          "fielddata": true,
        }
      }
    }
  }
}

Data

POST demo_products/product
{
  "brand": "New York Jets"
}

POST demo_products/product
{
  "brand": "new york jets"
}

POST demo_products/product
{
  "brand": "Washington Redskins"
}

Query

GET demo_products/product/_search
{
  "size": 0,
  "aggs": {
    "brand_facet": {
      "terms": {
        "field": "brand"
      }
    }
  }
}

Result

"aggregations": {
    "brand_facet": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "new york jets",
          "doc_count": 2
        },
        {
          "key": "washington redskins",
          "doc_count": 1
        }
      ]
    }
  }

If we use keyword instead of text we end up the 2 buckets for New York Jets because of the differences in casing.

We're concerned about the performance implications by using fielddata. However if fielddata is disabled we get the dreaded "Fielddata is disabled on text fields by default."

Any other tips to resolve this - or should we not be so concerned about fielddate?

Val · Accepted Answer

Starting with ES 5.2 (out today), you can use normalizers with keyword fields in order to (e.g.) lowercase the value.

The role of normalizers is a bit like analyzers for text fields, though what you can do with them is more restrained, but that would probably help with the issue you're facing.

You'd create the index like this:

PUT demo_products
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "filter": [ "lowercase" ]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "brand": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

And your query would return this:

  "aggregations" : {
    "brand_facet" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "new york jets",
          "doc_count" : 2
        },
        {
          "key" : "washington redskins",
          "doc_count" : 1
        }
      ]
    }
  }

Best of both worlds!

Elasticsearch fielddata - should I use it?

Tags:

elasticsearch

Rasmus

1 Answers

Val

Recent Activity

Donate For Us

Elasticsearch fielddata - should I use it?

Tags:

elasticsearch

Rasmus

1 Answers

Val

Related questions

Recent Activity

Donate For Us