Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get Percentage of Values in Elasticsearch

I have some test documents that look like

"hits": {
        ...
            "_source": {
               "student": "DTWjkg",
               "name": "My Name",
               "grade": "A"
            ...
               "student": "ggddee",
               "name": "My Name2",
               "grade": "B"
            ...
               "student": "ggddee",
               "name": "My Name3",
               "grade": "A"

And I wanted to get the percentage of students that have a grade of B, the result would be "33%", assuming there were only 3 students.

How would I do this in Elasticsearch?

So far I have this aggregation, which I feel like is close:

"aggs": {
    "gradeBPercent": {
        "terms": {
            "field" : "grade",
            "script" : "_value == 'B'"
        }
    }
}

This returns:

"aggregations": {
      "gradeBPercent": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "false",
               "doc_count": 2
            },
            {
               "key": "true",
               "doc_count": 1
            }
         ]
      }
   }

I'm not looking necessarily looking for an exact answer, perhaps what I could terms and keywords I could google. I've read over the elasticsearch docs and not found anything that could help.

like image 279
SSH This Avatar asked Mar 14 '23 03:03

SSH This


1 Answers

First off, you shouldn't need a script for this aggregation. If you want to limit your results to everyone where `value == 'B' then you should do that using a filter, not a script.

ElasticSearch won't return you a percentage exactly, but you can easily calculate that using the result from a TERMS AGGREGATION.

Example:

GET devdev/audittrail/_search
{
  "size": 0,
  "aggs": {
    "a1": {
      "terms": {
        "field": "uIDRequestID"
      }
    }
  }
}

That returns:

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 25083,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "a1": {
      "doc_count_error_upper_bound": 9,
      "sum_other_doc_count": 1300,
      "buckets": [
        {
          "key": 556,
          "doc_count": 34
        },
        {
          "key": 393,
          "doc_count": 28
        },
        {
          "key": 528,
          "doc_count": 15
        }
      ]
    }
  }
}

So what does that return mean?

  • the hits.total field is the total number of records matching your query.
  • the doc_count is telling you how many items are in each bucket.

So for my example here: I could say that the key "556" shows up in 34 of 25083 documents, so it has a percentage of (34 / 25083) * 100

like image 88
jhilden Avatar answered Mar 20 '23 02:03

jhilden