Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kibana, filter on count greater than or equal to X

I'm using Kibana to visualize some (Elasticsearch) data but I'd like to filter out all the results with "Count" less than 1000 (X).

I am using an Y-axis with a "count Aggregation", this is the count I'd like to filter on. I tried adding in a min_document_count as suggested by several online resources but this didn't change anything. Any help would be greatly appreciated.

My entire Kibana "data" tab:

Kibana settings

like image 512
Rick van Lieshout Avatar asked Oct 25 '16 13:10

Rick van Lieshout


People also ask

What is KQL in Kibana?

The Kibana Query Language (KQL) is a simple syntax for filtering Elasticsearch data using free text search or field-based search. KQL is only used for filtering data, and has no role in sorting or aggregating the data. KQL is able to suggest field names, values, and operators as you type.

How do I filter logs in Kibana dashboard?

Use the Logs app in Kibana to explore and filter your logs in real time. You can customize the output to focus on the data you want to see and to control how you see it. You can also view related application traces or uptime information where available.

How do I query data in Kibana?

Search your dataeditTo search all fields, enter a simple string in the query bar. To search particular fields and build more complex queries, use the Kibana Query language. As you type, KQL prompts you with the fields you can search and the operators you can use to build a structured query.


1 Answers

Using min_doc_count with order: ascending does not work as you would except.

TL;DR: Increasing shard_size and/or shard_min_doc_count should do the trick.


Why the aggregation is empty

As stated by the documentation:

The min_doc_count criterion is only applied after merging local terms statistics of all shards.

This mean that when you use a terms aggregations with the parameters size and min_doc_count and descending order, Elasticsearch retrieve the size less frequent terms in your data set and filter this list to keep only the terms with doc_count>min_doc_count.

If you want an example, given this dataset:

terms | doc_count
----------------
lorem | 3315
ipsum | 2487
olor  | 1484
sit   | 1057
amet  | 875
conse | 684
adip  | 124
elit  | 86

If you perform the aggregation with size=3 and min_doc_count=100 Elasticsearch will first compute the 3 less frequents terms:

conse: 684
adip : 124
elit : 86

and then filter for doc_count>100, so the final result would be:

conse: 684
adip : 124

Even though you would expect "amet" (doc_count=875) to appear in the list. Elasticsearch loose this field while computing the result and cannot retrieve it at the end.

If your case, you have so many terms with doc_count<1000 that they fill your list and then, after the filtering phase, you have no results.

Why is Elasticsearch behaving like this?

Everybody would like to apply a filter and then sort the results. We are able to do that with older datastore and it was nice. But Elasticsearch is designed to scale, so it turn off by default some of the magic that was used before.

Why? Because with large datasets it would break.

For instance, imagine that you have 800,000 different terms in your index, data is distributed over different shards (by default 4), that can be distributed other machine (at most 1 machine per shard).

When requesting terms with doc_count>1000, each machine has to compute several hundreds of thousands of counters (more than 200,000 since some occurrence of a term can be in one shard, others in another, etc). And since even if a shard saw a result only once, it may have been seen 999 times by the other shards, it cannot drop the information before merging the result. So we need to send more than 1 million counters over the network. So it is quite heavy, especially if it is done often.

So, by default, Elasticsearch will:

  1. Compute doc_count for each term in each shard.
  2. Not apply a filter on doc_count on a shard (loss in terms of speed and resource usage but better for accuracy): No shard_min_doc_count.
  3. Send the size * 1.5 + 10 (shard_size) terms to a node. It will be the less frequent terms if order is ascending, most frequent terms otherwise.
  4. Merge the counters in this node.
  5. Apply the min_doc_count filter.
  6. Return the size most/less frequent results.

Could it be simple for once?

Yes, sure, I said that this behavior was by default. If you do not have a huge dataset you can tune those parameters :)

Solution

If you are not OK with some loss of accuracy:

  • Increase the shard_size parameter to be greater than [your number of terms with a doc_count below your threshold] + [the number of values you want if you want exact results].

    If you want all the results with doc_count>=1000, set it to the cardinality of the field (number of different terms), but then I do not see the point of order: ascending.

    It has a massive memory impact if you have many terms, and a network if you have multiple ES nodes.

If you are OK with some loss of accuracy (often minor)

  • Set shard_size between this sum and [the number of values you want if you want exact results]. It is useful if you want more speed or if you do not have enough RAM to perform the exact computation. The good value for this one depends of your dataset.

  • Use the shard_min_doc_count parameter of the term aggregation to partially pre-filter the less frequent values. It is an efficient way to filter your data, especially if they are randomly distributed between your shards (default) and/or you do not have a lot of shards.

You can also put your data in one shard. There is no loss in term of accuracy but it is bad for performance and scaling. Yet you may not need the full power of ES if you have a small dataset.

NB: Descending order for terms aggregations is deprecated (because it cost a lot in terms of time and hardware to be accurate), it will most likely be removed in the future.

PS: You should add the Elasticsearch request generated by Kibana, it is often useful when Kibana is returning data but not the ones you want? You can find it in the "Request" tab when you click on the arrow that should be below your graph in your screenshot (ex: http://imgur.com/a/dMCWE).

like image 101
Pandawan Avatar answered Sep 21 '22 09:09

Pandawan