Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch: Aggregate Over a Collected Set of Results

Let's say I have a set of... burgers...

For each burger, I have a set of images relating to each component of the burger.

Unfortunately, there isn't any consistency in the structure of these components (I didn't write it).

Here is an example of two documents:

{
    "bunsResource": {
        "image": {
            "url": "./buns_1.png",
            "who": "Sam"
        },
        "buns": [
            {
                "image": {
                    "url": "./top-bun_1.png",
                    "who": "Jim"
                }
            },
            {
                "image": {
                    "url": "./bottom-bun_1.png",
                    "who": "Sarah"
                }
            }
        ]
    },
    "pattyResource": {
        "image": {
            "url": "./patties_1.png",
            "who": "Kathy"
        },
        "patties": [
            {
                "image": {
                    "url": "./patty_1.jpg",
                    "who": "Kathy"
                }
            }
        ]
    }
},
{
    "bunsResource": {
        "image": {
            "url": "./buns_2.png",
            "who": "Jim"
        },
        "buns": [
            {
                "image": {
                    "url": "./top-bun_2.png",
                    "who": "Jim"
                }
            },
            {
                "image": {
                    "url": "./bottom-bun_2.png",
                    "who": "Kathy"
                }
            }
        ]
    },
    "pattyResource": {
        "image": {
            "url": "./patties_1.png",
            "who": "Kathy"
        },
        "patties": [
            {
                "image": {
                    "url": "./patty_1.jpg",
                    "who": "Kathy"
                }
            }
        ]
    }
}

What I need is a set of photographer / image count.

{
    "who": "Sam",
    "count": 1
},
{
    "who": "Jim",
    "count": 3
},
{
    "who": "Sarah",
    "count": 2
},
{
    "who": "Kathy",
    "count": 2
}

That is a UNIQUE image count, mind you!

I haven't been able to figure out how to achieve this...

I assume that I need to first resolve each burger to a unique set of url / who, then aggregate from there, but I can't figure out how to get the flattened list of url / who per burger.

like image 437
WebWanderer Avatar asked Mar 04 '19 21:03

WebWanderer


People also ask

What is sub aggregation in Elasticsearch?

Sub-aggregations allow you to continuously refine and separate groups of criteria of interest, then apply metrics at various levels in the aggregation hierarchy to generate your report.

Is Elasticsearch good for aggregation?

Elasticsearch Aggregations provide you with the ability to group and perform calculations and statistics (such as sums and averages) on your data by using a simple search query. An aggregation can be viewed as a working unit that builds analytical information across a set of documents.

How do I count unique values in Elasticsearch?

There's no support for distinct counting in ElasticSearch, although non-deterministic counting exists. Use "terms" aggregation and count buckets in result. See Count distinct on elastic search question.


1 Answers

It depends on whether the patties and buns arrays are nested or not. If they are not, then it's easy, you can simply run a terms aggregation using a script that gathers all the who fields from everywhere in the document:

POST not-nested/_search 
{
  "size": 0,
  "aggs": {
    "script": {
      "terms": {
        "script": {
          "source": """
          def list = new ArrayList();
          list.addAll(doc['pattyResource.image.who.keyword'].values);
          list.addAll(doc['bunsResource.image.who.keyword'].values);
          list.addAll(doc['bunsResource.buns.image.who.keyword'].values);
          list.addAll(doc['pattyResource.patties.image.who.keyword'].values);
          return list;
          """
        }
      }
    }
  }
}

That will return this:

  "aggregations" : {
    "script" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Jim",
          "doc_count" : 2
        },
        {
          "key" : "Kathy",
          "doc_count" : 2
        },
        {
          "key" : "Sam",
          "doc_count" : 1
        },
        {
          "key" : "Sarah",
          "doc_count" : 1
        }
      ]
    }
  }

However, if it's nested, things get more complicated as you'll need some client-side work to figure out the final counts, but we can simplify that client-side work with a few aggregations:

POST nested/_search 
{
  "size": 0,
  "aggs": {
    "bunsWho": {
      "terms": {
        "field": "bunsResource.image.who.keyword"
      }
    },
    "bunsWhoNested": {
      "nested": {
        "path": "bunsResource.buns"
      },
      "aggs": {
        "who": {
          "terms": {
            "field": "bunsResource.buns.image.who.keyword"
          }
        }
      }
    },
    "pattiesWho": {
      "terms": {
        "field": "pattyResource.image.who.keyword"
      }
    },
    "pattiesWhoNested": {
      "nested": {
        "path": "pattyResource.patties"
      },
      "aggs": {
        "who": {
          "terms": {
            "field": "pattyResource.patties.image.who.keyword"
          }
        }
      }
    }
  }
}

That will return this:

  "aggregations" : {
    "pattiesWho" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Kathy",
          "doc_count" : 2
        }
      ]
    },
    "bunsWhoNested" : {
      "doc_count" : 4,
      "who" : {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets" : [
          {
            "key" : "Jim",
            "doc_count" : 2
          },
          {
            "key" : "Kathy",
            "doc_count" : 1
          },
          {
            "key" : "Sarah",
            "doc_count" : 1
          }
        ]
      }
    },
    "pattiesWhoNested" : {
      "doc_count" : 2,
      "who" : {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets" : [
          {
            "key" : "Kathy",
            "doc_count" : 2
          }
        ]
      }
    },
    "bunsWho" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Jim",
          "doc_count" : 1
        },
        {
          "key" : "Sam",
          "doc_count" : 1
        }
      ]
    }
  }

And then you can simply create some client-side logic (here some sample code in Node.js) that adds the numbers up:

var whos = {};
var recordWho = function(who, count) {
    whos[who] = (whos[who] || 0) + count;
};

resp.aggregations.pattiesWho.buckets.forEach(function(b) {recordWho(b.key, b.doc_count)});
resp.aggregations.pattiesWhoNested.who.buckets.forEach(function(b) {recordWho(b.key, b.doc_count)});
resp.aggregations.bunsWho.buckets.forEach(function(b) {recordWho(b.key, b.doc_count)});
resp.aggregations.bunsWhoNested.who.buckets.forEach(function(b) {recordWho(b.key, b.doc_count)});

console.log(whos);

=>

{ Kathy: 5, Jim: 3, Sam: 1, Sarah: 1 }
like image 197
Val Avatar answered Oct 11 '22 00:10

Val