Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch terms aggregation by strings in an array

Tags:

How can I write an Elasticsearch terms aggregation that splits the buckets by the entire term rather than individual tokens? For example, I would like to aggregate by state, but the following returns new, york, jersey and california as individual buckets, not New York and New Jersey and California as the buckets as expected:

curl -XPOST "http://localhost:9200/my_index/_search" -d'
{
    "aggs" : {
        "states" : {
            "terms" : { 
                "field" : "states",
                "size": 10
            }
        }
    }
}'

My use case is like the one described here https://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html with just one difference: the city field is an array in my case.

Example object:

{
    "states": ["New York", "New Jersey", "California"]
}

It seems that the proposed solution (mapping the field as not_analyzed) does not work for arrays.

My mapping:

{
    "properties": {
        "states": {
            "type":"object",
            "fields": {
                "raw": {
                    "type":"object",
                    "index":"not_analyzed"
                }
            }
        }
    }
}

I have tried to replace "object" by "string" but this is not working either.

like image 341
Marieke Avatar asked Nov 16 '15 17:11

Marieke


1 Answers

I think all you're missing is "states.raw" in your aggregation (note that, since no analyzer is specified, the "states" field is analyzed with the standard analyzer; the sub-field "raw" is "not_analyzed"). Though your mapping might bear looking at as well. When I tried your mapping against ES 2.0 I got some errors, but this worked:

PUT /test_index
{
   "mappings": {
      "doc": {
         "properties": {
            "states": {
               "type": "string",
               "fields": {
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            }
         }
      }
   }
}

Then I added a couple of docs:

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"states":["New York","New Jersey","California"]}
{"index":{"_id":2}}
{"states":["New York","North Carolina","North Dakota"]}

And this query seems to do what you want:

POST /test_index/_search
{
    "size": 0, 
    "aggs" : {
        "states" : {
            "terms" : { 
                "field" : "states.raw",
                "size": 10
            }
        }
    }
}

returning:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "states": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "New York",
               "doc_count": 2
            },
            {
               "key": "California",
               "doc_count": 1
            },
            {
               "key": "New Jersey",
               "doc_count": 1
            },
            {
               "key": "North Carolina",
               "doc_count": 1
            },
            {
               "key": "North Dakota",
               "doc_count": 1
            }
         ]
      }
   }
}

Here's the code I used to test it:

http://sense.qbox.io/gist/31851c3cfee8c1896eb4b53bc1ddd39ae87b173e

like image 73
Sloan Ahrens Avatar answered Oct 04 '22 17:10

Sloan Ahrens