Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch Query aggregated by unique substrings (email domain)

I have an elasticsearch query that queries over an index and then aggregates based on a specific field sender_not_analyzed. I then use a term aggregation on that same field sender_not_analyzed which returns buckets for the top "senders". My query is currently:

{
   "size": 0,
   "query": {
      "regexp": {
         "sender_not_analyzed": ".*[@].*"
      }
   },
   "aggs": {
      "sender-stats": {
         "terms": {
            "field": "sender_not_analyzed"
         }
      }
   }
}

which returns buckets that look like:

"aggregations": {
      "sender-stats": {
         "buckets": [
            {
               "key": "<Mike <[email protected]>@MISSING_DOMAIN>",
               "doc_count": 5017
            },
            {
               "key": "[email protected]",
               "doc_count": 3963
            },
            {
               "key": "[email protected]",
               "doc_count": 2857
            },
            {
              "key": "[email protected]",
              "doc_count":1544
            }

How can I write an aggregation such that I get single bucket for each unique email domain, eg foo.com would have a doc_count of (3963 + 2857) 6820? Can I accomplish this with a regex aggregation or do I need to write some kind of custom analyzer to split the string at the @ to the end of string?

like image 455
idclark Avatar asked Oct 20 '22 09:10

idclark


1 Answers

This is pretty late, but I think this can be done by using pattern_replace char filter, you capture the domain name with regex, This is my setup

POST email_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "char_filter": [
            "domain"
          ],
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "domain": {
          "type": "pattern_replace",
          "pattern": ".*@(.*)",
          "replacement": "$1"
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "domain": {
          "type": "string",
          "analyzer": "my_custom_analyzer"
        },
        "sender_not_analyzed": {
          "type": "string",
          "index": "not_analyzed",
          "copy_to": "domain"
        }
      }
    }
  }
}

Here domain char filter will capture the domain name, we need to use keyword tokenizer to get the domain as it is, I am using lowercase filter but it is up to you if you want to use it or not. Using copy_to parameter to copy the value of the sender_not_analyzed to domain field, although _source field won't be modified to include this value but we can query it.

GET email_index/_search
{
  "size": 0,
  "query": {
    "regexp": {
      "sender_not_analyzed": ".*[@].*"
    }
  },
  "aggs": {
    "sender-stats": {
      "terms": {
        "field": "domain"
      }
    }
  }
}

This will give you desired result.

like image 199
ChintanShah25 Avatar answered Oct 30 '22 23:10

ChintanShah25