Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch use "best match" of ngram terms instead of "synonym"?

Is it possible to tell ElasticSearch to use "best match" of all grams instead of using grams as synonyms?

By default ElasticSearch uses grams as synonyms and returns poorly matching documents. It's better to showcase with example, let's say we have two people in index:

alice wang
sarah kerry

We search for ali12345:

{
  query: {
    bool: {
      should: {
        match: { name: 'ali12345' }
      }
    }
  }
}

and it will return alice wang.

How is it possible? Because by default ElasticSearch uses grams as synonyms, so, even if just one gram matches - the document will be matched.

If you inspect the query you'll see that it treats grams as a synonyms

...
"explanation": {
  "value": 5.274891,
  "description": "weight(Synonym(name: ali name:li1 name:i12 name:123 name:234 name:345 ) in 0) [PerFieldSimilarity], result of:",
...

I wonder if it's possible to tell it to use "best match" query, to achieve something like:

{
  query: {
    bool: {
      should: [
        { term: { body: 'ali' }},
        { term: { body: 'li1' }},
        { term: { body: 'i12' }},
        { term: { body: '123' }},
        { term: { body: '234' }},
        { term: { body: '345' }},
      ],
      minimum_should_match: '75%'
    }
  }
}

Questions:

  1. It's possible of course generate this query manually, but then you have to apply ngram parsing and other analyzer pipeline manually. So I wonder if it could be done by ElasticSearch?

  2. What would be the performance of such query for long string, when there are tens of grams/terms? Will it be using some smart optimisations like with searching similar documents (see more_like_this) - when it tries to use not all the terms but only terms with highest tf-idf?

P.S.

The index configuration

{
  mappings: {
    object: {
      properties: {
        name: {
          type:     'text',
          analyzer: 'trigram_analyzer'
        }
      }
    }
  },

  settings: {
    analysis: {
      filter: {
        trigram_filter: { type: 'ngram', min_gram: 3, max_gram: 3 }
      },
      analyzer: {
        trigram_analyzer: {
          type:        'custom',
          tokenizer:   'keyword',
          filter:      [ 'trigram_filter' ]
        }
      }
    }
  }
}
like image 598
Alex Craft Avatar asked Dec 09 '17 13:12

Alex Craft


2 Answers

I know this question is old, but just in case...

you should be able to use the minimumShouldMatch clause on the trigram query to specify how many trigrams must match for a record to be considered a hit. you could use something like "3<75%", which means "if there 3 or less trigrams, then 100% must match. are there 4 or more trigrams, then 75% must match"

like image 190
Mario Köhler Avatar answered Oct 30 '22 14:10

Mario Köhler


Perhaps you have already found the reason, but ali12345 is matching alice wang because the analyzer at search time is the same one used for index time, including ngrams.

Such that:

At index time: for text alice wang, these terms are created [ali, lic, ice, ...]

At search time: for text ali12345, these terms are created [ali, li1, i12, ...]

As we can see we have a match with term ali

To avoid this problem, ElasticSearch provides the possibility to specify a different analyzer for search time. In the mapping for field name you can add another property search_analyzer that is normally very much similar to the main analyzer but without an ngram tokenfilter. This would prevent [ali, li1, i12] from being generated during search analysis resulting in 0 matches to alice wang

Feel free to look into more details and explanations on this page: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

like image 45
mrd3650 Avatar answered Oct 30 '22 12:10

mrd3650