Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct sorting for exact matches and "beginning with" (prefix) in Elasticsearch

I need to improve the result list on search with Elasticsearch.

Lets say we have 3 documents with single field and content like this:

  • "apple"
  • "green apple"
  • "apple tree"

If I search for "apple", it can happen, that I get the result sorted like this:

  • "green apple"
  • "apple tree"
  • "apple"

But what I want is the exact match to have the highest score, here it is the document with "apple".

Next highest score should be the entries beginning with the search word, here it is "apple tree" and rest sorted default way.

So I want to have it this:

  • "apple"
  • "apple tree"
  • "green apple"

I have tried to achieve it by using rescore:

curl -X GET "http://localhost:9200/my_index_name/_search?size=10&pretty" -H 'Content-Type: application/json' -d'
{
   "query": {
      "query_string": {
          "query": "apple"
      }
   },
   "rescore": {
      "window_size": 500,
      "query": {
         "score_mode": "multiply",
         "rescore_query": {
            "bool": {
               "should": [
                  {
                     "match": {
                        "my_field1": {
                           "query": "apple",
                           "boost": 4
                        }
                     }
                  },
                  {
                     "match": {
                        "my_field1": {
                           "query": "apple*",
                           "boost": 2
                        }
                     }
                  }
               ]
            }
         },
         "query_weight": 0.7,
         "rescore_query_weight": 1.2
      }
   }
}'

But this not really works, because Elasticsearch seems to separate all words by white spaces. For example search for "apple*" will also deliver "green apple". That seems to be the reason why rescore is not working for me.

Possibly there are other characters like dots ".", "-", ";" etc. which Elasticsearch takes for splitting and mess up my sorting.

I also played around with "match_phrase" in "rescore_query" instead of "bool", but without success.

I also have tried with only one match this:

curl -X GET "http://localhost:9200/my_index_name/_search?size=10&pretty" -H 'Content-Type: application/json' -d'
{
   "query": {
      "query_string": {
          "query": "apple"
      }
   },
   "rescore": {
      "window_size": 500,
      "query": {
         "score_mode": "multiply",
         "rescore_query": {
            "bool": {
               "should": [
                  {
                     "match": {
                        "my_field1": {
                           "query": "apple*",
                           "boost": 2
                        }
                     }
                  }
               ]
            }
         },
         "query_weight": 0.7,
         "rescore_query_weight": 1.2
      }
   }
}'

And it seems to work, but I am still not sure. Would this be the correct way to do it?

EDIT1: With other queries the one match rescore is not working correct.

like image 830
Andreas L. Avatar asked Mar 04 '23 14:03

Andreas L.


1 Answers

The only place where you require a manipulation in score is the exact match otherwise the order by position of terms give you the correct order. Lets understand this by the following:

Lets first create a mapping as below:

PUT test
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field1": {
          "type": "text",
          "analyzer": "whitespace",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

I have create field my_field1 with whitespace analyzer to make sure tokens are created by using space as only delimiter. Secondly, I have created a subfield named as keyword of type keyword. keyword will hold non-analyzed value of the input string and we'll use this for exact match.

Lets add few docs to the index:

PUT test/_doc/1
{
  "my_field1": "apple"
}

PUT test/_doc/2
{
  "my_field1": "apple tree"
}

PUT test/_doc/3
{
  "my_field1": "green apple"
}

If use the below query to search for term apple the order of docs will be 2,1,3.

POST test/_doc/_search
{
  "explain": true,
  "query": {
    "query_string": {
      "query": "apple",
      "fields": [
        "my_field1"
      ]
    }
  }
}

"explain": true in the above query give the score calculation steps in the output. Reading this will give you insight how a document is score.

All we need to do is, to boost the score for exact match. We'll run exact match against the field my_field1.keyword. You might have a question that why not my_field1. The reason for this is because my_field1 is analyzed, when tokens are generated for the input strings of the 3 docs, all will have a token (term) apple (along with other terms if present e.g. tree for doc 2 and green for doc 3) stored against this field. When we run exact match on this field for the term apple all docs will match and have similar effect on score for each document and hence no change in score. Since only one document have exact value as apple against my_field1.keyword that document (doc 1) will be a match for exact query and we'll boost this. So the query will be:

{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "query": "apple",
            "fields": [
              "my_field1"
            ]
          }
        },
        {
          "query_string": {
            "query": "\"apple\"",
            "fields": [
              "my_field1.keyword^2"
            ]
          }
        }
      ]
    }
  }
}

Output for above query:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.7260925,
    "hits": [
      {
        "_index": "test3",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.7260925,
        "_source": {
          "my_field1": "apple"
        }
      },
      {
        "_index": "test3",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
          "my_field1": "apple tree"
        }
      },
      {
        "_index": "test3",
        "_type": "_doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "my_field1": "green apple"
        }
      }
    ]
  }
}
like image 162
Nishant Avatar answered Mar 07 '23 03:03

Nishant