Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match multiple words as token prefixes

I'd like to take a query like "jan do" and have it match values like "jane doe", "don janek" -- and of course: "jan do", "do jan".

So the rules I can think of at the moment are:

  1. tokenize the query based on non-alphanumeric values (e.g. whitespace, symbols, punctuation)
  2. each query token acts as a prefix for matching tokens in the data store
  3. the order the tokens appear does not matter. It would be nice to prefer "jan do" to "do jan"

So far, I have this mapping

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "question": {
      "properties": {
        "title": {
          "type": "string"
        },
        "answer": {
          "type": "object",
          "properties": {
            "text": {
              "type": "string",
              "analyzer": "my_keyword",
              "fields": {
                "stemmed": {
                  "type": "string",
                  "analyzer": "standard"
                }
              }
            }
          }
        }
      }
    }
  }
}

I've been searching things as phrases:

POST /test/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.2,
      "queries": [
        {
          "match": {
            "answer.text": {
              "query": "jan do",
              "type": "phrase_prefix"
            }
          }
        },
        {
          "match": {
            "answer.text.stemmed": {
              "query": "jan do",
              "operator": "and"
            }
          }
        }
      ]
    }
  }
}

And that works okay when things actually start that phrase, but now I want to tokenize the query and treat each token like a prefix.

Is there a way I can do this (probably at query time)?

My other option is to just construct a query like this:

POST test/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "prefix": {
            "answer.text.stemmed": "jan"
          }
        },
        {
          "prefix": {
            "answer.text.stemmed": "do"
          }
        }
      ]
    }
  }
}

This seems to work, but it doesn't preserve the order of the words. Also, I feel like that's cheating and possibly not the most performant option. What if there were 10 prefixes? 100? I'd like to know whether anyone feels otherwise.

like image 980
user5243421 Avatar asked Feb 09 '23 14:02

user5243421


1 Answers

As the comment above suggests, you should take a look at ngrams in Elasticsearch, and in particular edge ngrams.

I wrote up an introduction to using ngrams in this blog post for Qbox, but here is a quick example you can play with.

Here is an index definition that applies an edge ngram token filter as well as several other filters to a custom analyzer (using the standard tokenizer).

There have been some changes in the way analyzers are applied in ES 2.0. But notice that I am using the standard analyzer for the "search_analyzer". This is because I don't want the search text to be tokenized into ngrams, I want it to be matched directly to indexed tokens. I'll refer you to the blog post for a description of the details.

Anyway, here is the mapping:

PUT /test_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "autocomplete": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "standard",
                  "stop",
                  "kstem",
                  "edgengram_filter"
               ]
            }
         },
         "filter": {
            "edgengram_filter": {
               "type": "edgeNGram",
               "min_gram": 2,
               "max_gram": 15
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "name": {
               "type": "string",
               "analyzer": "autocomplete",
               "search_analyzer": "standard"
            },
            "price":{
                "type": "integer"
            }
         }
      }
   }
}

Then I index a few simple documents:

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"name": "very cool shoes","price": 26}
{"index":{"_id":2}}
{"name": "great shampoo","price": 15}
{"index":{"_id":3}}
{"name": "shirt","price": 25}

And now the following query will get me the expected autocomplete results:

POST /test_index/_search
{
   "query": {
      "match": {
         "name": {
            "query": "ver sh",
            "operator": "and"
         }
      }
   }
}
...
{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.2169777,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.2169777,
            "_source": {
               "name": "very cool shoes",
               "price": 26
            }
         }
      ]
   }
}

Here is all the code I used in the example:

http://sense.qbox.io/gist/c2ba05900d0749fa3b1ba516c66431ae1a9d5e61

like image 185
Sloan Ahrens Avatar answered Feb 12 '23 16:02

Sloan Ahrens