How to match multiple words as token prefixes

Question

I'd like to take a query like "jan do" and have it match values like "jane doe", "don janek" -- and of course: "jan do", "do jan".

So the rules I can think of at the moment are:

tokenize the query based on non-alphanumeric values (e.g. whitespace, symbols, punctuation)
each query token acts as a prefix for matching tokens in the data store
the order the tokens appear does not matter. It would be nice to prefer "jan do" to "do jan"

So far, I have this mapping

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "question": {
      "properties": {
        "title": {
          "type": "string"
        },
        "answer": {
          "type": "object",
          "properties": {
            "text": {
              "type": "string",
              "analyzer": "my_keyword",
              "fields": {
                "stemmed": {
                  "type": "string",
                  "analyzer": "standard"
                }
              }
            }
          }
        }
      }
    }
  }
}

I've been searching things as phrases:

POST /test/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.2,
      "queries": [
        {
          "match": {
            "answer.text": {
              "query": "jan do",
              "type": "phrase_prefix"
            }
          }
        },
        {
          "match": {
            "answer.text.stemmed": {
              "query": "jan do",
              "operator": "and"
            }
          }
        }
      ]
    }
  }
}

And that works okay when things actually start that phrase, but now I want to tokenize the query and treat each token like a prefix.

Is there a way I can do this (probably at query time)?

My other option is to just construct a query like this:

POST test/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "prefix": {
            "answer.text.stemmed": "jan"
          }
        },
        {
          "prefix": {
            "answer.text.stemmed": "do"
          }
        }
      ]
    }
  }
}

This seems to work, but it doesn't preserve the order of the words. Also, I feel like that's cheating and possibly not the most performant option. What if there were 10 prefixes? 100? I'd like to know whether anyone feels otherwise.

Sloan Ahrens · Accepted Answer

As the comment above suggests, you should take a look at ngrams in Elasticsearch, and in particular edge ngrams.

I wrote up an introduction to using ngrams in this blog post for Qbox, but here is a quick example you can play with.

Here is an index definition that applies an edge ngram token filter as well as several other filters to a custom analyzer (using the standard tokenizer).

There have been some changes in the way analyzers are applied in ES 2.0. But notice that I am using the standard analyzer for the "search_analyzer". This is because I don't want the search text to be tokenized into ngrams, I want it to be matched directly to indexed tokens. I'll refer you to the blog post for a description of the details.

Anyway, here is the mapping:

PUT /test_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "autocomplete": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "standard",
                  "stop",
                  "kstem",
                  "edgengram_filter"
               ]
            }
         },
         "filter": {
            "edgengram_filter": {
               "type": "edgeNGram",
               "min_gram": 2,
               "max_gram": 15
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "name": {
               "type": "string",
               "analyzer": "autocomplete",
               "search_analyzer": "standard"
            },
            "price":{
                "type": "integer"
            }
         }
      }
   }
}

Then I index a few simple documents:

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"name": "very cool shoes","price": 26}
{"index":{"_id":2}}
{"name": "great shampoo","price": 15}
{"index":{"_id":3}}
{"name": "shirt","price": 25}

And now the following query will get me the expected autocomplete results:

POST /test_index/_search
{
   "query": {
      "match": {
         "name": {
            "query": "ver sh",
            "operator": "and"
         }
      }
   }
}
...
{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.2169777,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.2169777,
            "_source": {
               "name": "very cool shoes",
               "price": 26
            }
         }
      ]
   }
}

Here is all the code I used in the example:

http://sense.qbox.io/gist/c2ba05900d0749fa3b1ba516c66431ae1a9d5e61

How to match multiple words as token prefixes

Tags:

elasticsearch

user5243421

1 Answers

Sloan Ahrens

Recent Activity

Donate For Us

How to match multiple words as token prefixes

Tags:

elasticsearch

user5243421

1 Answers

Sloan Ahrens

Related questions

Recent Activity

Donate For Us