Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch Analyzer and Tokenizer for Emails

I could not find a perfect solution either in Google or ES for the following situation, hope someone could help here.

Suppose there are five email addresses stored under field "email":

1. {"email": "[email protected]"} 2. {"email": "[email protected], [email protected]"} 3. {"email": "[email protected]"} 4. {"email": "[email protected]} 5. {"email": "[email protected]"} 

I want to fulfill the following searching scenarios:

[Search -> Receive]

"[email protected]" -> 1,2

"[email protected]" -> 2,4

"[email protected]" -> 5

"john.doe" -> 1,2,3,4

"john" -> 1,2,3,4,5

"gmail.com" -> 1,2

"outlook.com" -> 2,3,4

The first three matchings is a MUST, and for the rest of them the more precise the better. Have already tried different combinations of index/search analyzers, tokenizers, and filters. Also tried to work on the condition for match queries, but did not find an ideal solution, any thought is welcome, and no limit to the mappings, analyzers, or which kind of query to use, thanks.

like image 560
LYu Avatar asked May 08 '15 04:05

LYu


People also ask

What is Analyzer and tokenizer in elasticsearch?

Elasticsearch analyzers and normalizers are used to convert text into tokens that can be searched. Analyzers use a tokenizer to produce one or more tokens per text field. Normalizers use only character filters and token filters to produce a single token.

What is the use of tokenizer in elasticsearch?

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace.

What is the use of analyzer in elasticsearch?

In a nutshell an analyzer is used to tell elasticsearch how the text should be indexed and searched. And what you're looking into is the Analyze API, which is a very nice tool to understand how analyzers work. The text is provided to this API and is not related to the index.

What is whitespace tokenizer in elasticsearch?

The whitespace tokenizer breaks text into terms whenever it encounters a whitespace character.


1 Answers

Mapping:

PUT /test {   "settings": {     "analysis": {       "filter": {         "email": {           "type": "pattern_capture",           "preserve_original": 1,           "patterns": [             "([^@]+)",             "(\\p{L}+)",             "(\\d+)",             "@(.+)",             "([^-@]+)"           ]         }       },       "analyzer": {         "email": {           "tokenizer": "uax_url_email",           "filter": [             "email",             "lowercase",             "unique"           ]         }       }     }   },   "mappings": {     "emails": {       "properties": {         "email": {           "type": "string",           "analyzer": "email"         }       }     }   } } 

Test data:

POST /test/emails/_bulk {"index":{"_id":"1"}} {"email": "[email protected]"} {"index":{"_id":"2"}} {"email": "[email protected], [email protected]"} {"index":{"_id":"3"}} {"email": "[email protected]"} {"index":{"_id":"4"}} {"email": "[email protected]"} {"index":{"_id":"5"}} {"email": "[email protected]"} 

Query to be used:

GET /test/emails/_search {   "query": {     "term": {       "email": "[email protected]"     }   } } 
like image 182
Andrei Stefan Avatar answered Sep 18 '22 13:09

Andrei Stefan