Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch Pattern_capture filter emits a token that is not matched with pattern also

I have a case where I have to extract domain part from emails that are found in a text. I used uax_url_email tokenizer to create emails as a single. And I have a pattern_capture filter which will emit "@(.+)" pattern string. But uax_url_email also return words also which is not an email and the pattern capture filter does not filter that. Any suggestions?

"custom_analyzer":{
 "tokenizer": "uax_url_email",
  "filter": [
       "email_domain_filter"
   ]
}
"filter": {
  "email_domain_filter":{
           "type": "pattern_capture",
           "preserve_original": false,
            "patterns": [
                      "@(.+)"
              ]
   }
}

input string : "my email id is [email protected]"

Output tokens: my, email, id, is, gmail.com

But I need only gmail.com

like image 879
Rajagopal Avatar asked Sep 16 '25 04:09

Rajagopal


1 Answers

"If none of the patterns match, or if preserveOriginal is true, the original token will be preserved."

https://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html

Try adding a pattern that matches the other tokens but does not contain a capture group (e.g. ".*")

like image 165
MTH Avatar answered Sep 19 '25 07:09

MTH