Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch: how to make an aggregation field not change the case of values

I have the following mapping for an aggregation field:

"language" : {
    "type" : "string",
    "index": "analyzed",
    "analyzer" : "standard"
}

The value of a sample document in this property may look like: "en zh_CN"

This property has no other use except aggregation. I notice that when I get aggregation results on this property:

{
  "query": {
        "filtered" : {
            "query": { 
                    "match_all": {}
            },
            "filter" : {
                 ...
            }
        }
    },
    "aggregations": {
        "facets": {
            "terms": {
                "field": "language"
            }
        }
    }   
}

The bucket key values are in lower case.

  "aggregations" : {
    "facets" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "zh_cn",
        "doc_count" : 2
      }, {
        "key" : "en",
        "doc_count" : 1
      } ]
    }
  }

How can I achieve my aggregation goal without letting ES to lowers the case of its values. I feel that I may need to change the mapping for this property, but not sure how.

Thanks and regards.

like image 910
curious1 Avatar asked Feb 10 '23 16:02

curious1


1 Answers

Try this in your mapping instead:

"language" : {
    "type" : "string",
    "index": "not_analyzed"
}

The text in that field of each document will be used, unmodified, to create tokens, and those tokens will be returned by your terms aggregation. For the example value you provided, the aggregation will return it verbatim:

"aggregations": {
   "facets": {
      "buckets": [
         {
            "key": "en zh_CN",
            "doc_count": 1
         }
      ]
   }
}

If you still want the text to be tokenized on whitespace, you can try using the whitespace analyzer in your mapping:

"language": {
   "type": "string",
   "analyzer": "whitespace"
}

Then your aggregation will return:

"aggregations": {
   "facets": {
      "buckets": [
         {
            "key": "en",
            "doc_count": 1
         },
         {
            "key": "zh_CN",
            "doc_count": 1
         }
      ]
   }
}

Here is the code I used to test both examples:

http://sense.qbox.io/gist/a7b3c7d50c7012537c50d576d03940b28b5f8793

like image 137
Sloan Ahrens Avatar answered Feb 13 '23 21:02

Sloan Ahrens