Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregation on top N results

Problem:

If I search for "iphone" I get 400 product results and the product category aggregation I have returns the top 3 categories in the results set.

Those categories would include smartphones, phone cases and mobile phone accessories.

If I search "iphone 6" I get 1400 results because of the extra "6" returns matches to more products. The product category aggregation now returns the top 3 categories for all those results.

The top 3 product categories will now be everything from cables to computer monitors.

What I need to do is get the top 3 categories for the top 100 results.


What I've tried:

I've tried using the top_hits aggregation within the top category aggregation but that only returns the top products in each category.

Something like this:

{
    "aggs": {

        "product_categories": {
            "terms": {
                "field": "product_category",
                "size": 10,
            }
        }        
        "aggs": {
            "top-categories": {
                "top_hits": {
                    "size" : 3
                }
            }
        }
    }
}

I've also tried creating a top_hits aggregation with a sub-aggregation within to get the top categories but that doesn't work either.

{
    "aggs": {
        "top-categories": {
            "top_hits": {
                "size" : 100
            }
            "aggs": {
                "product_categories": {
                    "terms": {
                        "field": "product_category",
                        "size": 3,
                    }
                }
            }
        }
    }
}

Can anyone help me with this problem?

like image 658
Ivar Avatar asked Mar 18 '15 15:03

Ivar


3 Answers

You could try using a filter aggregation based on a limit filter, and nest your terms aggregation in it.

Be aware that the limit is applied at shard level (see the documentation).

However, this should do the job for your case, with a query like :

{
  "aggs": {
    "limit_results": {
      "filter": {
        "limit": {
          "value": 100
        }
      },
      "aggs": {
        "product_categories": {
          "terms": {
            "field": "product_category",
            "size": 10
          }
        }
      }
    }
  }
}
like image 166
ThomasC Avatar answered Oct 20 '22 06:10

ThomasC


Before I begin, please note that this not a perfect solution to the question. However, it could definitively ease the situation and in a special case it actually is a perfect solution.

The solution I propose goes by sorting the terms aggregation buckets by the score of the document they were found in. That is, the ordering of the terms is no longer only by frequency but also by document score.

Here is an example request:

{
   "query": {
       "query_string": {
           "default_field": "product_title",
           "query": "iphone 6"
       }
   },
   "aggs": {
       "product_categories": {
           "terms": {
               "field": "product_category",
               "order": {
                   "max_score": "desc",
                   "_count": "desc"
               },
               "size": 3
           },
           "aggs": {
               "max_score": {
                   "max": {
                       "script": "_score"
                   }
               }
           }
       }
   }
}

Please note the "order" property of the terms aggregation. It specifies a path to the max_score aggregation which in turn just returns the special _score field which disposes the score of each hit document of the query. It does ALSO use the frequency of each time via the "_count" property on second position.

This request will give you the three terms in the product_category field that are the best of "very frequent and from highly ranked documents". I cannot say more explicitly how the ranking is done. I noticed in preliminary experiments that the result does not monotonously enumerate document scores but may "jump over" a quite highly ranked document when it only includes terms of low frequency - which actually might be what you want for your usecase. The documentation for this kind of ordering is found here: http://www.elastic.co/guide/en/elasticsearch/reference/1.x/search-aggregations-bucket-terms-aggregation.html

There is also an example in the above linked documentation for ordering by multiple criteria and just says "The above will sort the countries buckets based on the average height among the female population and then by their doc_count in descending order". My impression was it could be some kind of harmonic mean or something. Perhaps better look for yourself whether you find the results of this approach useful.

The special case I spoke of at the beginning is when each document has exactly one value in the requested field. In this case, you actually get the top N terms for the top N (because N is equal) documents when you leave out the "_count" ordering.

like image 44
khituras Avatar answered Oct 20 '22 08:10

khituras


You are looking for Sampler Aggregation. I have a similar answer at Aggregation on top n results

{
  "aggs": {
    "bestDocs": {
       "sampler": {
            "shard_size":100
         },
       "aggs": {
          "product_categories": {
             "terms": {
                "field": "product_category",
                "size": 3
             }
          }
       } 
   }
}

It will take the top 100 docs sorted by their scores and then do term aggregation.

like image 2
Rahul Avatar answered Oct 20 '22 08:10

Rahul