I am always getting a high value for an aggregation query in elasticsearch on the doc_count_error_upper_bound
attribute. It's sometimes as high as 8000 or 9000 for a ES cluster having almost a billion documents indexed. I run the query on an index of about 5M doc and I get the value to be about 300 to 500.
The question is how incorrect are my results (I am trying a top 20 count query based on the JSON below)
"aggs":{ "group_by_creator":{ "terms":{ "field":"creator" } } } }
If you set the show_term_doc_count_error parameter to true , the terms aggregation will include doc_count_error_upper_bound , which is an upper bound to the error on the doc_count returned by each shard.
Elasticsearch Aggregations provide you with the ability to group and perform calculations and statistics (such as sums and averages) on your data by using a simple search query. An aggregation can be viewed as a working unit that builds analytical information across a set of documents.
Bucket aggregations don't calculate metrics over fields like the metrics aggregations do, but instead, they create buckets of documents. Each bucket is associated with a criterion (depending on the aggregation type) which determines whether or not a document in the current context "falls" into it.
This is pretty well explained in the official documentation.
When running a terms
aggregation, each shard will figure out its own top-20 list of terms and will then return their 20 top terms. The coordinating node will gather all those terms and reorder them to get the overall top-20 terms for all the shards.
If you have more than one shard, it's no surprise that there might be a non-zero error count as shown in the official doc example and there's a way to compute the doc count error.
With one shard per index, the doc error count will always be zero, but it might not always be feasible depending on your index topology, especially if you have almost one billion documents. But for your index with 5M docs, if they are not to big, they could well be stored in a single shard. Of course, it depends a lot on your hardware, but if your shard size doesn't exceed 15/20GB, you should be fine. You should try to create a new index with a single shard and see how it goes.
I created this visualisation to try and understand it myself.
There are two levels of aggregation errors:
The first gives a value for the aggregation as a whole which represents the maximum potential document count for a term which did not make it into the final list of terms.
and
The second shows an error value for each term returned by the aggregation which represents the worst case error in the document count and can be useful when deciding on a value for the shard_size parameter. This is calculated by summing the document counts for the last term returned by all shards which did not return the term.
You can see the term level error by setting:
"show_term_doc_count_error": true
While the Whole Aggregation Error is shown by default
Quotes from official docs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With