Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional aggregation on multi-field in Elasticsearch

Here's an example of a document in my ES index:

{ 
    "concepts": [ 
        { 
            "type": "location",
            "entities": [ 
                { "text": "Raleigh" }, 
                { "text": "Damascus" }, 
                { "text": "Brussels" } 
            ] 
        }, 
        { 
            "type": "person", 
            "entities": [ 
                { "text": "Johnny Cash" }, 
                { "text": "Barack Obama" }, 
                { "text": "Vladimir Putin" }, 
                { "text": "John Hancock" } 
            ] 
        }, 
        { 
            "type": "organization", 
            "entities": [ 
                { "text": "WTO" }, 
                { "text": "IMF" }, 
                { "text": "United States of America" } 
            ] 
        } 
    ] 
}

I'm trying to aggregate and count the frequency of each concept entity in my set of documents for a specific concept type. Let's say I'm only interested in aggregating concept entities of type "location". My aggregation buckets are then going to be "concepts.entities.text", but I only want to aggregate them if "concepts.type" is equal to "location". Here's my attempt:

{
    "query": {
        // Whatever query
    },
    "aggs": {
        "location_concept_type": {
            "filter": {
                "term": { "concepts.type": "location" }
            },
            "aggs": {
                "entities": {
                    "terms": { "field": "concepts.hits.text" }
                }
            }
        }
    }
}

The problem with this is that it will filter out of the aggregation the documents that do not have any concept entities of type "location". But for the documents who do have concept entities of type "location" and something else, it will bucket all the concept entities, regardless of the concept type.

I have also tried by restructuring my doc in the following way:

{ 
    "concepts": [ 
        { 
            "type": "location",
            "text": "Raleigh"
        },
        { 
            "type": "location",
            "text": "Damascus"
        },
        { 
            "type": "location",
            "text": "Brussels"
        }, 
        { 
            "type": "person",
            "text": "Johnny Cash"
        },
        { 
            "type": "person",
            "text": "Barack Obama"
        }
        { 
            "type": "person",
            "text": "Vladimir Putin"
        }
        { 
            "type": "person",
            "text": "John Hancock"
        }, 
        { 
            "type": "organization",
            "text": "WTO" 
        },
        { 
            "type": "organization",
            "text": "IMF" 
        },
        { 
            "type": "organization",
            "text": "United States of America" 
        }
    ] 
}

But that doesn't work either. Finally I cannot use the concept type as the key (which would solve my problem, I believe), because I also need to be able to aggregate across all concept types (and there potentially is an indefinite and changing number of concept types).

Any idea of how to proceed? Thanks in advance for your help.

like image 548
cwarny Avatar asked Jul 10 '14 20:07

cwarny


People also ask

Can Kibana perform aggregation across fields that contain nested objects?

But visualizations in Kibana don't aggregate on nested fields like that, regardless of how you set your mappings -- if you want to run aggregations on the data in the items list, you aren't going to get the results you are looking for. Then doing the same sum aggregation should return the expected results.

What is nested aggregation?

Nested aggregationeditA special single bucket aggregation that enables aggregating nested documents. For example, lets say we have an index of products, and each product holds the list of resellers - each having its own price for the product.

What is cardinality aggregation Elasticsearch?

Cardinality aggregationedit. A single-value metrics aggregation that calculates an approximate count of distinct values.

What is Sum_other_doc_count?

sum_other_doc_count is the number of documents that didn't make it into the the top size terms.


1 Answers

I found a workaround that is kind of a hack. I'll put it as an answer but please feel free to add an alternative more elegant answer. What I did is to add a property alongside "type" and "text", let's call it "text_exp", that combines type and text as follows:

{
    "concepts": [
        { "type": "location", "text": "Raleigh", "text_exp": "location~Raleigh" },
        //...
    ]
}

Then I use a regex in the terms aggregation, as follows. Let's say I only want to aggregate entities of type "location":

{
    "query": {
        // Whatever query
    },
    "aggs": {
        "location_entities": {
            "terms": { 
                "field": "concepts.text_exp",
                "include": "location~.*"
            }
        }
    }
}

Then in the response I just split on "~" and take the right part.

like image 128
cwarny Avatar answered Oct 06 '22 15:10

cwarny