Count how often duplicates occur

Question

In Elasticsearch, I am trying to count the number of distinct field values in the dataset where the field value:

Appears exactly once.
Appears twice or more.

In a sense, I am trying to count how often duplicates occur. How can I do this?

Example

Let's say I have the following Elasticsearch documents:

{ "myfield": "bob" }
{ "myfield": "bob" }
{ "myfield": "alice" }
{ "myfield": "eve" }
{ "myfield": "mallory" }

Since "alice", "eve" and "mallory" appear once, and "bob" appears twice, I would expect:

number_of_values_that_appear_once: 3
number_of_values_that_appear_twice_or_more: 1

I can get part of the way with a terms aggregations and looking at the doc_count of each bucket. The output of a terms aggregation on myfield would look something like:

"buckets": [
  {
    "key": "bob",
    "doc_count": 3
  },
  {
    "key": "alice",
    "doc_count": 1
  },
  ...
]

From this output, I could just sum the number of buckets where doc_count == 1 for example. But this does not scale because I often have many thousands of distinct values so the bucket list would be enormous.

Pratik Patil · Accepted Answer

You can count duplicates via a scripted_metric based solution. A similar solution is explained in article "Accurate Distinct Count and Values from Elasticsearch". All you need to do is modify the solution query to count each occurrence of unique value instead of counting the unique values themselves.

Count how often duplicates occur

Tags:

aggregation

distinct

elasticsearch

Example

dlebech

1 Answers

Pratik Patil

Recent Activity

Donate For Us

Count how often duplicates occur

Tags:

aggregation

distinct

elasticsearch

Example

dlebech

1 Answers

Pratik Patil

Related questions

Recent Activity

Donate For Us