Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch default scoring mechanism

What I am looking for, is plain, clear explanation, of how default scoring mechanism of ElasticSearch (Lucene) really works. I mean, does it use Lucene scoring, or maybe it uses scoring of its own?

For example, I want to search for document by, for example, "Name" field. I use .NET NEST client to write my queries. Let's consider this type of query:

IQueryResponse<SomeEntity> queryResult = client.Search<SomeEntity>(s =>
    s.From(0)
   .Size(300)
   .Explain()
   .Query(q => q.Match(a => a.OnField(q.Resolve(f => f.Name)).QueryString("ExampleName")))
);

which is translated to such JSON query:

{
 "from": 0,
 "size": 300,
 "explain": true,
 "query": {
   "match": {
     "Name": {
       "query": "ExampleName"
      }
    }
  }
}

There is about 1.1 million documents that search is performed on. What I get in return, is (that is only part of the result, formatted on my own):

650   "ExampleName" 7,313398

651   "ExampleName" 7,313398

652   "ExampleName" 7,313398

653   "ExampleName" 7,239194

654   "ExampleName" 7,239194

860   "ExampleName of Something" 4,5708737  

where first field is just an Id, second is Name field on which ElasticSearch performed it's searching, and third is score.

As you can see, there are many duplicates in ES index. As some of found documents have diffrent score, despite that they are exactly the same (with only diffrent Id), I concluded that diffrent shards performed searching on diffrent parts of whole dataset, which leads me to trail that the score is somewhat based on overall data in given shard, not exclusively on document that is actually considered by search engine.

The question is, how exactly does this scoring work? I mean, could you tell me/show me/point me to exact formula to calculate score for each document found by ES? And eventually, how this scoring mechanism can be changed?

like image 527
Przemysław Kalita Avatar asked Jul 08 '13 08:07

Przemysław Kalita


People also ask

What algorithm does Elasticsearch use?

Elasticsearch runs Lucene under the hood so by default it uses Lucene's Practical Scoring Function. This is a similarity model based on Term Frequency (tf) and Inverse Document Frequency (idf) that also uses the Vector Space Model (vsm) for multi-term queries.

What is Elasticsearch max score?

One of the benefits of using Elasticsearch is that it supports the use of Max Score, which can be used to improve the accuracy of search results. Max Score is a scoring algorithm that takes into account the relevance of a document to a query, as well as the number of times the document has been viewed.

How do I change Elasticsearch score?

According to your comment, you need the _score to be multiplied by the document's score field. You can achieve it simply by removing the boost_mode parameter, the default boost_mode is to multiply the _score with whatever value comes out of the field_value_factor function.

What is Dfs_query_then_fetch?

search_type=dfs_query_then_fetch to your search requests. The dfs stands for Distributed Frequency Search, and it tells Elasticsearch to first retrieve the local IDF from each shard in order to calculate the global IDF across the whole index.


1 Answers

The default scoring is the DefaultSimilarity algorithm in core Lucene, largely documented here. You can customize scoring by configuring your own Similarity, or using something like a custom_score query.

The odd score variation in the first five results shown seems small enough that it doesn't concern me much, as far as the validity of the query results and their ordering, but if you want to understand the cause of it, the explain api can show you exactly what is going on there.

like image 67
femtoRgon Avatar answered Oct 08 '22 04:10

femtoRgon