Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch multi field fuzzy search not returning exact match first

I am performing a fuzzy elasticsearch query on 'text' and 'keywords' fields. I have two documents in elasticsearch, one with 'text' "testPhone 5" and the other "testPhone 4s". When I perform a fuzzy query with "testPhone 5", I am seeing that both documents are being given the exact same score value. Why is this occurring?

Extra info: I am indexing documents using the 'uax_url_email' tokenizer and 'lowercase' filter.

This is the query I am making:

{
    query : {
        bool: {
            // match one or the other fuzzy query
            should: [
                {
                    fuzzy: {
                        text: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 5,
                        }
                    }
                },
                {
                    fuzzy: {
                        keywords: {
                            min_similarity: 0.4,
                            value: 'testphone 5',
                            prefix_length: 0,
                            boost: 1,
                        }
                    }
                }
            ]
        }
    },
    sort: [ 
        '_score'
    ],
    explain: true
}

This is the result:

{ max_score: 0.47213298,
  total: 2,
  hits:
  [ { _index: 'test',
     _shard: 0,
     _id: '51fbf95f82e89ae8c300002c',
     _node: '0Mtfzbe1RDinU71Ordx-Ag',
     _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000027', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002c',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 5',
      keywords: [ [length]: 0 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 { _index: 'test',
   _shard: 4,
   _id: '51fbf95f82e89ae8c300002d',
   _node: '0Mtfzbe1RDinU71Ordx-Ag',
   _source:
    { next: { id: '51fbf95f82e89ae8c3000027' },
      cards: [ '51fbf95f82e89ae8c3000029', [length]: 1 ],
      other: false,
      _id: '51fbf95f82e89ae8c300002d',
      category: '51fbf95f82e89ae8c300002b',
      image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
      text: 'testPhone 4s',
      keywords: [ 'apple', [length]: 1 ],
      __v: 0 },
   _type: 'productgroup',
   _explanation:
    { details:
       [ { details:
            [ { details:
                 [ { details:
                      [ { details:
                           [ { value: 3.8888888, description: 'boost' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.17020021,
                               description: 'queryNorm' },
                             [length]: 3 ],
                          value: 0.99999994,
                          description: 'queryWeight, product of:' },
                        { details:
                           [ { details:
                                [ { value: 1, description: 'termFreq=1.0' },
                                  [length]: 1 ],
                               value: 1,
                               description: 'tf(freq=1.0), with freq of:' },
                             { value: 1.5108256,
                               description: 'idf(docFreq=2, maxDocs=5)' },
                             { value: 0.625,
                               description: 'fieldNorm(doc=0)' },
                             [length]: 3 ],
                          value: 0.944266,
                          description: 'fieldWeight in 0, product of:' },
                        [length]: 2 ],
                     value: 0.94426596,
                     description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
                   [length]: 1 ],
                value: 0.94426596,
                description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
              [length]: 1 ],
           value: 0.94426596,
           description: 'sum of:' },
         { value: 0.5, description: 'coord(1/2)' },
         [length]: 2 ],
      value: 0.47213298,
      description: 'product of:' },
   _score: 0.47213298 },
 [length]: 2 ] }
like image 791
tez Avatar asked Oct 21 '22 03:10

tez


2 Answers

Fuzzy queries are not analyzed but the field is so your search for testphone 5 with a distance of 0.4 yields the analyzed term testphone for both documents and that term is used to further filter down the results

description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },

See also @imotov excellent answer here: ElasticSearch's Fuzzy Query

You can see how exactly a string will be tokenized using the _analyze API

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

i.e

http://localhost:9200/prefix_test/_analyze?field=text&text=testphone+5

will return:

{
   "tokens": [
      {
         "token": "testphone",
         "start_offset": 0,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "5",
         "start_offset": 10,
         "end_offset": 11,
         "type": "<NUM>",
         "position": 2
      }
   ]
}

So even if you index the value testphone sammsung a fuzzy query for "testphone samsunk" won't yield anything where as just samsunk will.

You may get better results by not analyzing (or using the keyword analyzer) the field.

If you want to have different analysis on a single field you can use the multi_field construct.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html

like image 66
Martijn Laarman Avatar answered Oct 30 '22 21:10

Martijn Laarman


I ran into this issue myself recently. I can't tell you exactly why it is happening, but I CAN tell you how I fixed it:

I ran 2 queries over the same field, one with an exact match, and then the exact same query on the same field with fuzzy matches enabled and a lower boost.

That made sure that my exact matches always ended higher then the fuzzy matches.

P.S. I think they're scored equal because, because of the fuzziness, the both match and ES doesn't care that one is an exact match as long as the both match, but this is pure theory crafting on my end since i'm not intimately familiar with the scoring algorithm.

like image 38
Constantijn Visinescu Avatar answered Oct 30 '22 21:10

Constantijn Visinescu