Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word-oriented completion suggester (ElasticSearch 5.x)

ElasticSearch 5.x introduced some (breaking) changes to the Suggester API (Documentation). Most notable change is the following:

Completion suggester is document-oriented

Suggestions are aware of the document they belong to. Now, associated documents (_source) are returned as part of completion suggestions.

In short, all completion queries return all matching documents instead of just matched words. And herein lies the problem - duplication of autocompleted words if they occur in more than one document.

Let's say we have this simple mapping:

{
   "my-index": {
      "mappings": {
         "users": {
            "properties": {
               "firstName": {
                  "type": "text"
               },
               "lastName": {
                  "type": "text"
               },
               "suggest": {
                  "type": "completion",
                  "analyzer": "simple"
               }
            }
         }
      }
   }
}

With a few test documents:

{
   "_index": "my-index",
   "_type": "users",
   "_id": "1",
   "_source": {
      "firstName": "John",
      "lastName": "Doe",
      "suggest": [
         {
            "input": [
               "John",
               "Doe"
            ]
         }
      ]
   }
},
{
   "_index": "my-index",
   "_type": "users",
   "_id": "2",
   "_source": {
      "firstName": "John",
      "lastName": "Smith",
      "suggest": [
         {
            "input": [
               "John",
               "Smith"
            ]
         }
      ]
   }
}

And a by-the-book query:

POST /my-index/_suggest?pretty
{
    "my-suggest" : {
        "text" : "joh",
        "completion" : {
            "field" : "suggest"
        }
    }
}

Which yields the following results:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "1",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Doe",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Doe"
                       ]
                    }
                 ]
               }
            },
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "2",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Smith",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Smith"
                       ]
                    }
                 ]
               }
            }
         ]
      }
   ]
}

In short, for a completion suggest for text "joh", two (2) documents were returned - both John's and both had the same value of the text property.

However, I would like to receive one (1) word. Something simple like this:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
          "John"
         ]
      }
   ]
}

Question: how to implement a word-based completion suggester. There is no need to return any document related data, since I don't need it at this point.

Is the "Completion Suggester" even appropriate for my scenario? Or should I use a completely different approach?


EDIT: As many of you pointed out, an additional completion-only index would be a viable solution. However, I can see multiple issues with this approach:

  1. Keeping the new index in sync.
  2. Auto-completing subsequent words would probably be global, instead of narrowed down. For example, say you have the following words in the additional index: "John", "Doe", "David", "Smith". When querying for "John D", the result for the incomplete word should be "Doe" and not "Doe", "David".

To overcome the second point, only indexing single words wouldn't be enough, since you would also need to map all words to documents in order to properly narrow down auto-completing subsequent words. And with this, you actually have the same problem as querying the original index. Therefore, the additional index doesn't make sense anymore.

like image 285
alesc Avatar asked Jan 19 '17 14:01

alesc


3 Answers

As hinted at in the comment, another way of achieving this without getting the duplicate documents is to create a sub-field for the firstname field containing ngrams of the field. First you define your mapping like this:

PUT my-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "completion_analyzer": {
          "type": "custom",
          "filter": [
            "lowercase",
            "completion_filter"
          ],
          "tokenizer": "keyword"
        }
      },
      "filter": {
        "completion_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 24
        }
      }
    }
  },
  "mappings": {
    "users": {
      "properties": {
        "autocomplete": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword"
            },
            "completion": {
              "type": "text",
              "analyzer": "completion_analyzer",
              "search_analyzer": "standard"
            }
          }
        },
        "firstName": {
          "type": "text"
        },
        "lastName": {
          "type": "text"
        }
      }
    }
  }
}

Then you index a few documents:

POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }

Then you can query for joh and get one result for John and another one for Johnny

{
  "size": 0,
  "query": {
    "term": {
      "autocomplete.completion": "john d"
    }
  },
  "aggs": {
    "suggestions": {
      "terms": {
        "field": "autocomplete.raw"
      }
    }
  }
}

Results:

{
  "aggregations": {
    "suggestions": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "John Doe",
          "doc_count": 1
        },
        {
          "key": "John Deere",
          "doc_count": 1
        }
      ]
    }
  }
}

UPDATE (June 25th, 2019):

ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html

like image 55
Val Avatar answered Nov 07 '22 05:11

Val


An additional field skip_duplicates will be added in the next release 6.x.

From the docs at https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters-completion.html#skip_duplicates:

POST music/_search?pretty
{
    "suggest": {
        "song-suggest" : {
            "prefix" : "nor",
            "completion" : {
                "field" : "suggest",
                "skip_duplicates": true
            }
        }
    }
}
like image 38
Dries Cleymans Avatar answered Nov 07 '22 05:11

Dries Cleymans


We face exactly the same problem. In Elasticsearch 2.4 the approach like you describe used to work fine for us but now as you say the suggester has become document-based while like you we are only interested in unique words, not in the documents.

The only 'solution' we could think of so far is to create a separate index just for the words on which we want to perform the suggestion queries and in this separate index make sure somehow that identical words are only indexed once. Then you could perform the suggestion queries on this separate index. This is far from ideal, if only because we will then need to make sure that this index remains in sync with the other index that we need for our other queries.

like image 31
Edgar Vonk Avatar answered Nov 07 '22 06:11

Edgar Vonk