Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exact-match, case-insensitive match without normalization in Elasticsearch 6.2

I have looked at every article and post I could find about performing exact-match, case-insensitive queries, but upon implementation, they do not perform what I am looking for.

Before you mark this question as a duplicate, please read the entire post.

Given a username, I want to query my Elasticsearch database to only return a document that exactly matches the username, but is also case insensitive.

I have tried specifying a lowercase analyzer for my username property and use a match query to implement this behavior. While this solves the problem of case insensitive matching, it fails at exact matching.

I looked into using a lowercase normalizer, but that would make all of my usernames lowercase before indexing, so when I aggregate the usernames, they would return in lowercase form, which is not what I want. I need to preserve the original case of each letter in the username.

What I want is the following behavior:


Inserting Users

POST {elastic}/users/_doc

{
    "email": "[email protected]",
    "username": "UsErNaMe",
    "password": "1234567"
}

This document will be stored in an index called users exactly the way it is.

Getting a User by Username

GET {frontend}/user/UsErNaMe

should return

{
    "email": "[email protected]",
    "username": "UsErNaMe",
    "password": "1234567"
}

and

GET {frontend}/user/username

should return

{
    "email": "[email protected]",
    "username": "UsErNaMe",
    "password": "1234567"
}

and

GET {frontend}/user/USERNAME

should return

{
    "email": "[email protected]",
    "username": "UsErNaMe",
    "password": "1234567"
}

and

GET {frontend}/user/UsErNaMe $RaNdoM LeTteRs

should NOT return anything.

Thank you.

like image 714
Hid Avatar asked Apr 18 '19 08:04

Hid


1 Answers

To achieve case insensitive exact match you need to define you own analyzer. The analyzer need to perform two actions:

  1. lowercase the input value. (for case insensitive)
  2. no to any modification to the input after lowercase action. (for exact search)

The above two can be achieve by:

  1. use lowercase filter when defining custom analyzer.
  2. set the tokenizer to keyword, this will make sure to generate single token of the input value after lowercase filter is applied.

Now this custom analyzer can be applied to a text field where case insensitive exact search is required.

So to create index you can use below:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "case_insensitive_analyzer": {
          "type": "custom",
          "filter": [
            "lowercase"
          ],
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "email": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "username": {
          "type": "text",
          "analyzer": "case_insensitive_analyzer"
        },
        "password": {
          "type": "keyword"
        }
      }
    }
  }
}

In the above case_insensitive_analyzer is the required analyzer and as you can see it is applied on username field.

So when you index a document as below:

PUT test/_doc/1
{
  "email": "[email protected]",
  "username": "UsErNaMe",
  "password": "1234567"
}

for the field username the input is UsErNaMe. The analyzer first applies lowercase filter on the input UsErNaMe resulting into the value username. Now on this value username it applies keyword tokenizer which does nothing but output the value obtained after applying filter(s), as a single token i.e. username.

Now you can use match query as below to search against user name field:

GET test/_doc/_search
{
  "query": {
    "match": {
      "username": "USERNAME"
    }
  }
}

Using above will give you desired output. Replace USERNAME in above query to username or UsErNaMe or USERname all will match the document. The reason for this is that while searching if no analyser is explicitly specified, elasticsearch uses the analyzer applied to the field while indexing. In the above case when searching against field username, case_insensitive_analyzer will be applied to input value i.e. USERNAME which will result in token username and hence the match.

like image 97
Nishant Avatar answered Nov 11 '22 12:11

Nishant