Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Emulate a SQL LIKE search with ElasticSearch

I'm just beginning with ElasticSearch and trying to implement an autocomplete feature based on it.

I have an autocomplete index with a field city of type string. Here's an example of a document stored into this index:

{  
   "_index":"autocomplete_1435797593949",
   "_type":"listing",
   "_id":"40716",
   "_source":{  
      "city":"Rome",
      "tags":[  
         "listings"
      ]
   }
}

The analyse configuration looks like this:

{  
   "analyzer":{  
      "autocomplete_term":{  
         "tokenizer":"autocomplete_edge",
         "filter":[  
            "lowercase"
         ]
      },
      "autocomplete_search":{  
         "tokenizer":"keyword",
         "filter":[  
            "lowercase"
         ]
      }
   },
   "tokenizer":{  
      "autocomplete_edge":{  
         "type":"nGram",
         "min_gram":1,
         "max_gram":100
      }
   }
}

The mappings:

{  
   "autocomplete_1435795884170":{  
      "mappings":{  
         "listing":{  
            "properties":{  
               "city":{  
                  "type":"string",
                  "analyzer":"autocomplete_term"
               },
            }
         }
      }
   }
}

I'm sending the following Query to ES:

{  
   "query":{  
      "multi_match":{  
         "query":"Rio",
         "analyzer":"autocomplete_search",
         "fields":[  
            "city"
         ]
      }
   }
}

As a result, I get the following:

{  
   "took":2,
   "timed_out":false,
   "_shards":{  
      "total":5,
      "successful":5,
      "failed":0
   },
   "hits":{  
      "total":1,
      "max_score":2.7742395,
      "hits":[  
         {  
            "_index":"autocomplete_1435795884170",
            "_type":"listing",
            "_id":"53581",
            "_score":2.7742395,
            "_source":{  
               "city":"Rio",
               "tags":[  
                  "listings"
               ]
            }
         }
      ]
   }
}

For the most part, it works. It does find the document with a city = "Rio" before the user has to actually type the whole word ("Ri" is enough).

And here lies my problem. I want it to return "Rio de Janeiro", too. To get "Rio de Janeiro", I need to send the following query:

  {  
       "query":{  
          "multi_match":{  
             "query":"Rio d",
             "analyzer":"standard",
             "fields":[  
                "city"
             ]
          }
       }
    }

Notice the "<whitespace>d" there.

Another related problem is that I'd expect at least all cities that start with an "R" to be returned with the following query:

  {  
       "query":{  
          "multi_match":{  
             "query":"R",
             "analyzer":"standard",
             "fields":[  
                "city"
             ]
          }
       }
    }

I'd expect "Rome", etc... (which is a document that exists in the index), however, I only get "Rio", again. I would like it to behave like the SQL LIKE condition, i.e ... LIKE 'CityName%'.

What am I doing wrong?

like image 672
FullOfCaffeine Avatar asked Oct 20 '22 08:10

FullOfCaffeine


2 Answers

I would do it like this:

  • change the tokenizer to edge_nGram since you said you need LIKE 'CityName%' (meaning a prefix match):
  "tokenizer": {
    "autocomplete_edge": {
      "type": "edge_nGram",
      "min_gram": 1,
      "max_gram": 100
    }
  }
  • have the field specify your autocomplete_search as a search_analyzer. I think it's a good choice to have a keyword and lowercase:
  "mappings": {
    "listing": {
      "properties": {
        "city": {
          "type": "string",
          "index_analyzer": "autocomplete_term",
          "search_analyzer": "autocomplete_search"
        }
      }
    }
  }
  • and the query itself is as simple as:
{
  "query": {
    "multi_match": {
      "query": "R",
      "fields": [
        "city"
      ]
    }
  }
}

The detailed explanation goes like this: split your city names in edge ngrams. For example, for Rio de Janeiro you'll index something like:

           "city": [
              "r",
              "ri",
              "rio",
              "rio ",
              "rio d",
              "rio de",
              "rio de ",
              "rio de j",
              "rio de ja",
              "rio de jan",
              "rio de jane",
              "rio de janei",
              "rio de janeir",
              "rio de janeiro"
           ]

You notice that everything is lowercased. Now, you'd want your query to take any text (lowercase or not) and to match it with what's in the index. So, an R should match that list above.

For this to happen you want the input text to be lowercased and to be kept exactly like the user set it, meaning it shouldn't be analyzed. Why you'd want this? Because you already have split the city names in ngrams and you don't want the same for the input text. If user inputs "RI", Elasticsearch will lowercase it - ri - and match it exactly against what it has in the index.

A probably faster alternative to multi_match is to use a term, but this requires your application/website to lowercase the text. The reason for this is that term doesn't analyze the input text at all.

{
  "query": {
    "filtered": {
      "filter": {
        "term": {
          "city": {
            "value": "ri"
          }
        }
      }
    }
  }
}
like image 121
Andrei Stefan Avatar answered Oct 21 '22 23:10

Andrei Stefan


In Elasticsearch, there is Completion Suggester to give suggestions. Completion Suggester

like image 30
chengpohi Avatar answered Oct 21 '22 22:10

chengpohi