Kibana, filter on count greater than or equal to X

Tags:

kibana

I'm using Kibana to visualize some (Elasticsearch) data but I'd like to filter out all the results with "Count" less than 1000 (X).

I am using an Y-axis with a "count Aggregation", this is the count I'd like to filter on. I tried adding in a min_document_count as suggested by several online resources but this didn't change anything. Any help would be greatly appreciated.

My entire Kibana "data" tab:

Kibana settings

512

asked Oct 25 '16 13:10

Rick van Lieshout

1 Answers

Using min_doc_count with order: ascending does not work as you would except.

TL;DR: Increasing shard_size and/or shard_min_doc_count should do the trick.

Why the aggregation is empty

As stated by the documentation:

The min_doc_count criterion is only applied after merging local terms statistics of all shards.

This mean that when you use a terms aggregations with the parameters size and min_doc_count and descending order, Elasticsearch retrieve the size less frequent terms in your data set and filter this list to keep only the terms with doc_count>min_doc_count.

If you want an example, given this dataset:

terms | doc_count
----------------
lorem | 3315
ipsum | 2487
olor  | 1484
sit   | 1057
amet  | 875
conse | 684
adip  | 124
elit  | 86

If you perform the aggregation with size=3 and min_doc_count=100 Elasticsearch will first compute the 3 less frequents terms:

conse: 684
adip : 124
elit : 86

and then filter for doc_count>100, so the final result would be:

conse: 684
adip : 124

Even though you would expect "amet" (doc_count=875) to appear in the list. Elasticsearch loose this field while computing the result and cannot retrieve it at the end.

If your case, you have so many terms with doc_count<1000 that they fill your list and then, after the filtering phase, you have no results.

Why is Elasticsearch behaving like this?

Everybody would like to apply a filter and then sort the results. We are able to do that with older datastore and it was nice. But Elasticsearch is designed to scale, so it turn off by default some of the magic that was used before.

Why? Because with large datasets it would break.

For instance, imagine that you have 800,000 different terms in your index, data is distributed over different shards (by default 4), that can be distributed other machine (at most 1 machine per shard).

When requesting terms with doc_count>1000, each machine has to compute several hundreds of thousands of counters (more than 200,000 since some occurrence of a term can be in one shard, others in another, etc). And since even if a shard saw a result only once, it may have been seen 999 times by the other shards, it cannot drop the information before merging the result. So we need to send more than 1 million counters over the network. So it is quite heavy, especially if it is done often.

So, by default, Elasticsearch will:

Compute doc_count for each term in each shard.
Not apply a filter on doc_count on a shard (loss in terms of speed and resource usage but better for accuracy): No shard_min_doc_count.
Send the size * 1.5 + 10 (shard_size) terms to a node. It will be the less frequent terms if order is ascending, most frequent terms otherwise.
Merge the counters in this node.
Apply the min_doc_count filter.
Return the size most/less frequent results.

Could it be simple for once?

Yes, sure, I said that this behavior was by default. If you do not have a huge dataset you can tune those parameters :)

Solution

If you are not OK with some loss of accuracy:

Increase the shard_size parameter to be greater than [your number of terms with a doc_count below your threshold] + [the number of values you want if you want exact results].

If you want all the results with doc_count>=1000, set it to the cardinality of the field (number of different terms), but then I do not see the point of order: ascending.

It has a massive memory impact if you have many terms, and a network if you have multiple ES nodes.

If you are OK with some loss of accuracy (often minor)

Set shard_size between this sum and [the number of values you want if you want exact results]. It is useful if you want more speed or if you do not have enough RAM to perform the exact computation. The good value for this one depends of your dataset.
Use the shard_min_doc_count parameter of the term aggregation to partially pre-filter the less frequent values. It is an efficient way to filter your data, especially if they are randomly distributed between your shards (default) and/or you do not have a lot of shards.

You can also put your data in one shard. There is no loss in term of accuracy but it is bad for performance and scaling. Yet you may not need the full power of ES if you have a small dataset.

NB: Descending order for terms aggregations is deprecated (because it cost a lot in terms of time and hardware to be accurate), it will most likely be removed in the future.

PS: You should add the Elasticsearch request generated by Kibana, it is often useful when Kibana is returning data but not the ones you want? You can find it in the "Request" tab when you click on the arrow that should be below your graph in your screenshot (ex: http://imgur.com/a/dMCWE).

101

answered Sep 21 '22 09:09

Pandawan

Related questions
                            
                                Is it better to store nested data or use flat structure with unique names in JSON?
                            
                                Difference between Weight and boost in Elasticsearch
                            
                                Elasticsearch - Want to sort by field in all indices where that particular field available or not if not then avoid it
                            
                                Elastic search Query terms and scoring
                            
                                logstash file input configuration
                            
                                Elasticsearch strange behaviour for queries straight after insertion
                            
                                Determining which words were matched in a fuzzy search
                            
                                Elasticsearch query with nested aggregations causing out of memory
                            
                                Elasticsearch NEST - Filtering on multilevel nested types
                            
                                Best way to index arbitrary attribute value pairs on elastic search
                            
                                How to prevent Elasticsearch from index throttling?
                            
                                Nest: how are you supposed to deal with the highlights in c#
                            
                                Elasticsearch asciifolding not working properly
                            
                                Elasticsearch: How to store term vectors
                            
                                Script-based sorting on Elasticsearch date field
                            
                                elasticsearch filter by length of a string field
                            
                                Elastic bulk error: failed to parse
                            
                                Scope Elasticsearch Results to Specific Ids
                            
                                No query registered for []
                            
                                Port issues with Vagrant and Elasticsearch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With