Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

elasticsearch scoring on multiple indexes

i have an index for any quarter of a year ("index-2015.1","index-2015.2"... )

i have around 30 million documents on each index.

a document has a text field ('title')

my document sorting method is (1)_score (2)created date

the problem is:

when searching for some text on on 'title' field for all indexes ("index-201*"), always the first results is from one index.

lets say if i am searching for 'title=home' and i have 10k documents on "index-2015.1" with title=home and 10k documents on "index-2015.2" with title=home then the first results are all documents from "index-2015.1" (and not from "index-2015.2", or mixed) even that on "index-2015.2" there are documents with "created date" higher then in "index-2015.1".

is there a reason for this?

like image 855
Eyal Ch Avatar asked Oct 30 '15 09:10

Eyal Ch


2 Answers

The reason is probably, that the scores are specific to the index. So if you really have multiple indices, the result score of the documents will be calculated (slightly) different for each index.

Simply put, among other things, the score of a matching document is dependent on the query terms and their occurrences in the index. The score is calculated in regard to the index (actually, by default even to each separate shard). There are some normalizations elasticsearch does, but I don't know the details of those.

I'm not really able to explain it well, but here's the article about scoring. I think you want to read at least the part about TF/IDF. Which I think, should explain why you get different scores.

https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html


EDIT:

So, after testing it a bit on my machine, it seems possible to use another search_type, to achieve a score suitable for your case.

POST /index1,index2/_search?search_type=dfs_query_then_fetch
{
    "query" : {
       "match": {
          "title": "home"
       }
    }
}

The important part is search_type=dfs_query_then_fetch. If you are programming java or something similar, there should be a way to specify it in the request. For details about the search_types, refer to the documentation.

Basically it will first collect the term-frequencies on all affected shards (+ indexes). Therefore the score should be generalized over all these.

like image 50
Slomo Avatar answered Oct 16 '22 11:10

Slomo


according to Andrei Stefan and Slomo, index boosting solve my problem:

   body={ 
       "indices_boost" : { "index-2015.4" : 1.4, "index-2015.3" : 1.3,"index-2015.2" : 1.2 ,"index-2015.1" : 1.1 }
        }

EDIT:

using search_type=dfs_query_then_fetch (as Slomo described) will solve the problem in better way (depend what is your business model...)

like image 2
Eyal Ch Avatar answered Oct 16 '22 11:10

Eyal Ch