What is the significance of doc_count_error_upper_bound in elasticsearch and how can it be minimized?

Tags:

elasticsearch

I am always getting a high value for an aggregation query in elasticsearch on the doc_count_error_upper_bound attribute. It's sometimes as high as 8000 or 9000 for a ES cluster having almost a billion documents indexed. I run the query on an index of about 5M doc and I get the value to be about 300 to 500.

The question is how incorrect are my results (I am trying a top 20 count query based on the JSON below)

"aggs":{ "group_by_creator":{ "terms":{ "field":"creator" } } } }

532

asked May 29 '16 18:05

thegrid

2 Answers

This is pretty well explained in the official documentation.

When running a terms aggregation, each shard will figure out its own top-20 list of terms and will then return their 20 top terms. The coordinating node will gather all those terms and reorder them to get the overall top-20 terms for all the shards.

If you have more than one shard, it's no surprise that there might be a non-zero error count as shown in the official doc example and there's a way to compute the doc count error.

With one shard per index, the doc error count will always be zero, but it might not always be feasible depending on your index topology, especially if you have almost one billion documents. But for your index with 5M docs, if they are not to big, they could well be stored in a single shard. Of course, it depends a lot on your hardware, but if your shard size doesn't exceed 15/20GB, you should be fine. You should try to create a new index with a single shard and see how it goes.

133

answered Oct 21 '22 00:10

Val

I created this visualisation to try and understand it myself.

Example of elastic aggregation errors

There are two levels of aggregation errors:

Whole Aggregation - shows you the potential value of a missing term
Term Level - indicates the potential inaccuracy in a returned term

The first gives a value for the aggregation as a whole which represents the maximum potential document count for a term which did not make it into the final list of terms.

and

The second shows an error value for each term returned by the aggregation which represents the worst case error in the document count and can be useful when deciding on a value for the shard_size parameter. This is calculated by summing the document counts for the last term returned by all shards which did not return the term.

You can see the term level error by setting:

"show_term_doc_count_error": true

While the Whole Aggregation Error is shown by default

Quotes from official docs

answered Oct 21 '22 00:10

whatapalaver

Related questions
                            
                                ElasticSearch NEST Search Multiple Types & All Fields
                            
                                How to configure Jaeger with elasticsearch?
                            
                                ElasticSearch default scoring mechanism
                            
                                elasticsearch: auto node discovery not happening, missing anything?
                            
                                Spring Data Elasticsearch's @Field annotation not working
                            
                                Elastic search - search_after parameter
                            
                                Elasticsearch : when to set omit_norms option as false
                            
                                Elasticsearch aggregation order by top hit score
                            
                                Change ID in elasticsearch
                            
                                Elasticsearch: No handler for type [keyword] declared on field [hostname]
                            
                                Example of Angular and Elasticsearch
                            
                                Elasticsearch Java API - building queries
                            
                                How to Get All Results from Elasticsearch in Python
                            
                                Running Elastic without the Trial License
                            
                                Integration test elastic search, timing issue, document not found
                            
                                Elasticsearch char_filter replace any character with whitespace?
                            
                                Is ImmutableSettings removed in Elastic-search?
                            
                                "No valid url specified" when trying to install Kibana's Sense plugin
                            
                                Create index in Elastic Search by Java API
                            
                                Elasticsearch: Return only nested inner_hits

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With