In Elasticsearch, I am trying to count the number of distinct field values in the dataset where the field value:
In a sense, I am trying to count how often duplicates occur. How can I do this?
Let's say I have the following Elasticsearch documents:
{ "myfield": "bob" }
{ "myfield": "bob" }
{ "myfield": "alice" }
{ "myfield": "eve" }
{ "myfield": "mallory" }
Since "alice", "eve" and "mallory" appear once, and "bob" appears twice, I would expect:
number_of_values_that_appear_once: 3
number_of_values_that_appear_twice_or_more: 1
I can get part of the way with a terms aggregations and looking at the doc_count
of each bucket. The output of a terms aggregation on myfield
would look something like:
"buckets": [
{
"key": "bob",
"doc_count": 3
},
{
"key": "alice",
"doc_count": 1
},
...
]
From this output, I could just sum the number of buckets where doc_count == 1
for example. But this does not scale because I often have many thousands of distinct values so the bucket list would be enormous.
You can count duplicates via a scripted_metric based solution. A similar solution is explained in article "Accurate Distinct Count and Values from Elasticsearch". All you need to do is modify the solution query to count each occurrence of unique value instead of counting the unique values themselves.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With