Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch Aggregation Query with multiple excludes

I have a bunch of company data in an ES database. I am looking to pull counts of how many documents each company occurs in, but I'm having some problems with the aggregation query. I am looking to exclude terms such as "Corporation" or "Inc." Thus far I have been able to do this successfully for one term at a time as per the code below.

{
    "aggs" : {
        "companies" : {
            "terms" : {
                "field" : "Companies.name",
                "exclude" : "corporation"
            }
        }
    }
}

Which returns

"aggregations": {
    "assignee": {
         "buckets": [
            {
               "key": "inc",
               "doc_count": 375
            },
            {
               "key": "company",
               "doc_count": 252
            }
         ]
     }
}

Ideally I'd like to be able to do something like

{
    "aggs" : {
        "companies" : {
            "terms" : {
                "field" : "Companies.name",
                "exclude" : ["corporation", "inc.", "inc", "co", "company", "the", "industries", "incorporated", "international"],
            }
        }
    }
}

But I haven't been able to find a way that doesn't throw an error

I have looked at the "Terms" section of Aggregation in the ES documentation and can only find an example for a single exclude.I'm wondering if it's possible to exclude multiple terms and if so what is the correct syntax for doing so.

Note: I know I could set the field to "not_analyzed" and get groupings for full company names rather than the split names. However, I'm hesitant to do this as analyzing allows a bucket to be more tolerant of name variations (ie Microsoft Corp & Microsoft Corporation)

like image 651
drowningincode Avatar asked Apr 01 '14 20:04

drowningincode


1 Answers

The exclude parameter is a regular expression, so you could use a regular expression that exhaustively lists all choices:

"exclude" :
    "corporation|inc\\.|inc|co|company|the|industries|incorporated|international"

Doing this generically, it's important to escape values (e.g., .). If it is not generically generated, then you could simplify some of these by grouping them (e.g., inc\\.? covers inc\\.|inc, or the more complicated: co(mpany|rporation)?). If this is going to run a lot, then it's probably worth testing how the added complexity effects performance.

There are also optional flags that can be applied, which are the options that exist in Java Pattern. The one that might come in handy is CASE_INSENSITIVE.

"exclude" : {
    "pattern" : "...expression as before...",
    "flags" : "CASE_INSENSITIVE"
}
like image 100
pickypg Avatar answered Nov 16 '22 08:11

pickypg