Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch fielddata - should I use it?

Given an index with documents that have a brand property, we need to create a term aggregation that is case insensitive.

Index definition

Please note that the use of fielddata

PUT demo_products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "brand": {
          "type": "text",
          "analyzer": "my_custom_analyzer",
          "fielddata": true,
        }
      }
    }
  }
}

Data

POST demo_products/product
{
  "brand": "New York Jets"
}

POST demo_products/product
{
  "brand": "new york jets"
}

POST demo_products/product
{
  "brand": "Washington Redskins"
}

Query

GET demo_products/product/_search
{
  "size": 0,
  "aggs": {
    "brand_facet": {
      "terms": {
        "field": "brand"
      }
    }
  }
}

Result

"aggregations": {
    "brand_facet": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "new york jets",
          "doc_count": 2
        },
        {
          "key": "washington redskins",
          "doc_count": 1
        }
      ]
    }
  }

If we use keyword instead of text we end up the 2 buckets for New York Jets because of the differences in casing.

We're concerned about the performance implications by using fielddata. However if fielddata is disabled we get the dreaded "Fielddata is disabled on text fields by default."

Any other tips to resolve this - or should we not be so concerned about fielddate?

like image 254
Rasmus Avatar asked Jan 26 '17 07:01

Rasmus


1 Answers

Starting with ES 5.2 (out today), you can use normalizers with keyword fields in order to (e.g.) lowercase the value.

The role of normalizers is a bit like analyzers for text fields, though what you can do with them is more restrained, but that would probably help with the issue you're facing.

You'd create the index like this:

PUT demo_products
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "filter": [ "lowercase" ]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "brand": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

And your query would return this:

  "aggregations" : {
    "brand_facet" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "new york jets",
          "doc_count" : 2
        },
        {
          "key" : "washington redskins",
          "doc_count" : 1
        }
      ]
    }
  }

Best of both worlds!

like image 190
Val Avatar answered Oct 18 '22 01:10

Val