Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count how often duplicates occur

In Elasticsearch, I am trying to count the number of distinct field values in the dataset where the field value:

  • Appears exactly once.
  • Appears twice or more.

In a sense, I am trying to count how often duplicates occur. How can I do this?

Example

Let's say I have the following Elasticsearch documents:

{ "myfield": "bob" }
{ "myfield": "bob" }
{ "myfield": "alice" }
{ "myfield": "eve" }
{ "myfield": "mallory" }

Since "alice", "eve" and "mallory" appear once, and "bob" appears twice, I would expect:

number_of_values_that_appear_once: 3
number_of_values_that_appear_twice_or_more: 1

I can get part of the way with a terms aggregations and looking at the doc_count of each bucket. The output of a terms aggregation on myfield would look something like:

"buckets": [
  {
    "key": "bob",
    "doc_count": 3
  },
  {
    "key": "alice",
    "doc_count": 1
  },
  ...
]

From this output, I could just sum the number of buckets where doc_count == 1 for example. But this does not scale because I often have many thousands of distinct values so the bucket list would be enormous.

like image 585
dlebech Avatar asked Nov 10 '22 07:11

dlebech


1 Answers

You can count duplicates via a scripted_metric based solution. A similar solution is explained in article "Accurate Distinct Count and Values from Elasticsearch". All you need to do is modify the solution query to count each occurrence of unique value instead of counting the unique values themselves.

like image 154
Pratik Patil Avatar answered Nov 15 '22 08:11

Pratik Patil