Suppose I have an index for cars on a dealer's car lot. Each document resembles the following: <pre class="prettyprint"><code>{ color: 'red', model_year: '2015', date_added: '2015-07-20' } </code></pre> Suppose I have a million cars. Suppose I want to present a view of the most recently added 1000 cars, along with facets over those 1000 cars. I could just use <code>from</code> and <code>size</code> to paginate the results up to a fixed limit of 1000, but in doing so the totals and facets on <code>model_year</code> and <code>color</code> (i.e. aggregations) I get back from Elasticsearch aren't right--they're over the entire matched set. How do I limit my search to the most recently added 1000 documents for pagination and aggregation?

As you probably saw in the documentation, the aggregations are performed on the scope of the query itself. If no query is given, the aggregations are performed on a <code>match_all</code> list of results. Even if you would use <code>size</code> at the query level, it will still not give you what you need because <code>size</code> is just a way of returning a set of documents from all the documents the query matched. Aggregations operate on what the query matches. This feature request is not new and has been asked for before some time ago. In 1.7 there is no straight forward solution. Maybe you can use the limit filter or terminate_after in-body request parameter, but this will not return the documents that were, also, sorted. This will give you the first <code>terminate_after</code> number of docs that matched the query and this number is per shard. This is not performed after the sorting has been applied. In ES 2.0 there is, also, the sampler aggregation which works more or less the same way as the <code>terminate_after</code> is working, but this one takes into consideration the score of the documents to be considered from each shard. In case you just sort after <code>date_added</code> and the query is just a <code>match_all</code> all the documents will have the same score and it will be returning an irrelevant set of documents. In conclusion: <ul> <li>there is no good solution for this, there are workarounds with number of docs per shard. So, if you want 1000 cars, then you need to take this number divide it by the number of primary shards, use it in <code>sampler</code> aggregation or with <code>terminate_after</code> and get a set of documents</li> <li>my suggestion is to use a query to limit the number of documents (cars) by a different criteria instead. For example, show (and aggregate) the cars in the last 30 days or something similar. Meaning, the criteria should be included in the query itself, so that the resulting set of documents to be the one you want it aggregated. Applying aggregations to a certain number of documents, after they have been sorted, is not easy.</li> </ul>

How do I compute facets/aggregations for the top n documents, with pagination in Elasticsearch?

Tags:

pagination

elasticsearch

faceted-search

Suppose I have an index for cars on a dealer's car lot. Each document resembles the following:

{
  color: 'red',
  model_year: '2015',
  date_added: '2015-07-20'
}

Suppose I have a million cars.

Suppose I want to present a view of the most recently added 1000 cars, along with facets over those 1000 cars.

I could just use from and size to paginate the results up to a fixed limit of 1000, but in doing so the totals and facets on model_year and color (i.e. aggregations) I get back from Elasticsearch aren't right--they're over the entire matched set.

How do I limit my search to the most recently added 1000 documents for pagination and aggregation?

768

asked Jul 21 '15 12:07

Michael Haren

1 Answers

As you probably saw in the documentation, the aggregations are performed on the scope of the query itself. If no query is given, the aggregations are performed on a match_all list of results. Even if you would use size at the query level, it will still not give you what you need because size is just a way of returning a set of documents from all the documents the query matched. Aggregations operate on what the query matches.

This feature request is not new and has been asked for before some time ago.

In 1.7 there is no straight forward solution. Maybe you can use the limit filter or terminate_after in-body request parameter, but this will not return the documents that were, also, sorted. This will give you the first terminate_after number of docs that matched the query and this number is per shard. This is not performed after the sorting has been applied.

In ES 2.0 there is, also, the sampler aggregation which works more or less the same way as the terminate_after is working, but this one takes into consideration the score of the documents to be considered from each shard. In case you just sort after date_added and the query is just a match_all all the documents will have the same score and it will be returning an irrelevant set of documents.

In conclusion:

there is no good solution for this, there are workarounds with number of docs per shard. So, if you want 1000 cars, then you need to take this number divide it by the number of primary shards, use it in sampler aggregation or with terminate_after and get a set of documents
my suggestion is to use a query to limit the number of documents (cars) by a different criteria instead. For example, show (and aggregate) the cars in the last 30 days or something similar. Meaning, the criteria should be included in the query itself, so that the resulting set of documents to be the one you want it aggregated. Applying aggregations to a certain number of documents, after they have been sorted, is not easy.

108

answered Oct 09 '22 19:10

Andrei Stefan

Related questions
                            
                                Term, nested documents and must_not query incompatible in ElasticSearch?
                            
                                How to return only aggregation stats in an ElasticSearch query?
                            
                                Elasticsearch unlimited size
                            
                                Elasticsearch wildcard search on not_analyzed field
                            
                                Join query in ElasticSearch
                            
                                Elasticsearch server discovery configuration
                            
                                Elasticsearch clients for python, no solution
                            
                                elasticsearch-painless - Manipulate date
                            
                                Is there a way to count all elements of an index in ElasticSearch or Tire?
                            
                                Check if the index exists or not Elasticsearch
                            
                                Highlighting matched results on _all fields
                            
                                ElasticSearch phrase prefix search - How do I get the matched phrase?
                            
                                ElasticSearch - Filter Nested Aggregation
                            
                                Custom metadata for elasticsearch documents

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With