I have a bunch of company data in an ES database. I am looking to pull counts of how many documents each company occurs in, but I'm having some problems with the aggregation
query. I am looking to exclude terms such as "Corporation" or "Inc." Thus far I have been able to do this successfully for one term at a time as per the code below.
{
"aggs" : {
"companies" : {
"terms" : {
"field" : "Companies.name",
"exclude" : "corporation"
}
}
}
}
Which returns
"aggregations": {
"assignee": {
"buckets": [
{
"key": "inc",
"doc_count": 375
},
{
"key": "company",
"doc_count": 252
}
]
}
}
Ideally I'd like to be able to do something like
{
"aggs" : {
"companies" : {
"terms" : {
"field" : "Companies.name",
"exclude" : ["corporation", "inc.", "inc", "co", "company", "the", "industries", "incorporated", "international"],
}
}
}
}
But I haven't been able to find a way that doesn't throw an error
I have looked at the "Terms" section of Aggregation in the ES documentation and can only find an example for a single exclude.I'm wondering if it's possible to exclude multiple terms and if so what is the correct syntax for doing so.
Note: I know I could set the field to "not_analyzed" and get groupings for full company names rather than the split names. However, I'm hesitant to do this as analyzing allows a bucket to be more tolerant of name variations (ie Microsoft Corp & Microsoft Corporation)
The exclude
parameter is a regular expression, so you could use a regular expression that exhaustively lists all choices:
"exclude" :
"corporation|inc\\.|inc|co|company|the|industries|incorporated|international"
Doing this generically, it's important to escape values (e.g., .
). If it is not generically generated, then you could simplify some of these by grouping them (e.g., inc\\.?
covers inc\\.|inc
, or the more complicated: co(mpany|rporation)?
). If this is going to run a lot, then it's probably worth testing how the added complexity effects performance.
There are also optional flags
that can be applied, which are the options that exist in Java Pattern
. The one that might come in handy is CASE_INSENSITIVE
.
"exclude" : {
"pattern" : "...expression as before...",
"flags" : "CASE_INSENSITIVE"
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With