Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the significance of doc_count_error_upper_bound in elasticsearch and how can it be minimized?

I am always getting a high value for an aggregation query in elasticsearch on the doc_count_error_upper_bound attribute. It's sometimes as high as 8000 or 9000 for a ES cluster having almost a billion documents indexed. I run the query on an index of about 5M doc and I get the value to be about 300 to 500.

The question is how incorrect are my results (I am trying a top 20 count query based on the JSON below)

"aggs":{ "group_by_creator":{ "terms":{ "field":"creator" } } } }
like image 532
thegrid Avatar asked May 29 '16 18:05

thegrid


People also ask

What is Doc_count_error_upper_bound?

If you set the show_term_doc_count_error parameter to true , the terms aggregation will include doc_count_error_upper_bound , which is an upper bound to the error on the doc_count returned by each shard.

How do Elasticsearch aggregations work?

Elasticsearch Aggregations provide you with the ability to group and perform calculations and statistics (such as sums and averages) on your data by using a simple search query. An aggregation can be viewed as a working unit that builds analytical information across a set of documents.

What is Bucket aggregation in Elasticsearch?

Bucket aggregations don't calculate metrics over fields like the metrics aggregations do, but instead, they create buckets of documents. Each bucket is associated with a criterion (depending on the aggregation type) which determines whether or not a document in the current context "falls" into it.


2 Answers

This is pretty well explained in the official documentation.

When running a terms aggregation, each shard will figure out its own top-20 list of terms and will then return their 20 top terms. The coordinating node will gather all those terms and reorder them to get the overall top-20 terms for all the shards.

If you have more than one shard, it's no surprise that there might be a non-zero error count as shown in the official doc example and there's a way to compute the doc count error.

With one shard per index, the doc error count will always be zero, but it might not always be feasible depending on your index topology, especially if you have almost one billion documents. But for your index with 5M docs, if they are not to big, they could well be stored in a single shard. Of course, it depends a lot on your hardware, but if your shard size doesn't exceed 15/20GB, you should be fine. You should try to create a new index with a single shard and see how it goes.

like image 133
Val Avatar answered Oct 21 '22 00:10

Val


I created this visualisation to try and understand it myself.

Example of elastic aggregation errors

There are two levels of aggregation errors:

  • Whole Aggregation - shows you the potential value of a missing term
  • Term Level - indicates the potential inaccuracy in a returned term

The first gives a value for the aggregation as a whole which represents the maximum potential document count for a term which did not make it into the final list of terms.

and

The second shows an error value for each term returned by the aggregation which represents the worst case error in the document count and can be useful when deciding on a value for the shard_size parameter. This is calculated by summing the document counts for the last term returned by all shards which did not return the term.

You can see the term level error by setting:

"show_term_doc_count_error": true

While the Whole Aggregation Error is shown by default

Quotes from official docs

like image 27
whatapalaver Avatar answered Oct 21 '22 00:10

whatapalaver