Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch outputs the score of 1.0 for all results when searching for a single "starred" term

We are using Elasticsearch to search for the most relevant companies in a specific catalog. When we use the normal search term like lettering we get reasonable scores and can sort the results according to the score.

However, when we modify the search term before querying and make the "starred" version of it (e.g., *lettering*) to be able to search for substrings we get a score of 1.0 for every result. The search for substrings is a requirement in the project.

Any ideas on what could cause this relevance computation? The problem occurs only when a single term is used. We get comprehensible scores when we use two starred terms in combination (e.g., *lettering* *digital*).

EDIT 1:

Exemplary mapping (YAML, other properties are mapped in the same way, excepting boost which is different for each property):

    elasticSearchMapping:
      type: object
      include_in_all: true
      enabled: true
      properties:
        'keywords':
          type: string
          include_in_all: true
          boost: 50

Query:

{
"query": {
    "filtered": {
        "query": {
            "bool": {
                "must": [{
                    "match_all": []
                }, {
                    "query_string": {
                        "query": "*lettering*"
                    }
                }]
            }
        },
        "filter": {
            "bool": {
                "must": [{
                    "term": {
                        "__parentPath": "/sites/industrycatalog"
                    }
                }, {
                    "terms": {
                        "__workspace": ["live"]
                    }
                }, {
                    "term": {
                        "__dimensionCombinationHash": "d751713988987e9331980363e24189ce"
                    }
                }, {
                    "term": {
                        "__typeAndSupertypes": "IndustryCatalog:Entry"
                    }
                }],
                "should": [],
                "must_not": [{
                    "term": {
                        "_hidden": true
                    }
                }, {
                    "range": {
                        "_hiddenBeforeDateTime": {
                            "gt": "now"
                        }
                    }
                }, {
                    "range": {
                        "_hiddenAfterDateTime": {
                            "lt": "now"
                        }
                    }
                }]
            }
        }
    }
},
"fields": ["__path"],
"script_fields": {
    "distance": {
        "script": "doc['coordinates'].distanceInKm(51.75631079999999,14.332867899999997)"
    }
},
"sort": [{
    "customer.featureFlags.industrycatalog": {
        "order": "asc"
    }
}, {
    "_geo_distance": {
        "coordinates": {
            "lat": "51.75631079999999",
            "lon": "14.332867899999997"
        },
        "order": "asc",
        "unit": "km",
        "distance_type": "plane"
    }
}],
"size": 999999

}

like image 756
cenetp Avatar asked Jan 08 '16 17:01

cenetp


People also ask

How are Elasticsearch scores calculated?

Before scoring documents, Elasticsearch first reduces the set of candidate documents by applying a boolean test that only includes documents that match the query. A score is then calculated for each document in this set, and this score determines how the documents are ordered.

What is score in Elasticsearch query?

Elasticsearch uses two kinds of similarity scoring function: TF-IDF before version 5.0 and Okapi BM25 after. TF-IDF measures how much a word is common locally and rare globally to determine how much relevant a query is.

Why does Elasticsearch not return all results?

The reason might be that you haven't provided the size parameter in the query. This limits the result count to 10 by default. Out of all the results the top 10 might be from the two index even thought the match is present in third index as well.

Can Elasticsearch return more than 10000 results?

By default, you cannot use from and size to page through more than 10,000 hits. This limit is a safeguard set by the index. max_result_window index setting. If you need to page through more than 10,000 hits, use the search_after parameter instead.


1 Answers

What you are doing is wildcard query, They fall under term level queries and by default constant score is applied.

Check the Lucene Documentation, WildcardQuery extends MultiTermQuery

You can also verify this with the help of explain api, you will something like this

"_explanation": {
     "value": 1,
     "description": "ConstantScore(company:lettering), product of:",
     "details": [{
         "value": 1,
         "description": "boost"
     }, {
         "value": 1,
         "description": "queryNorm"
     }]
 }

You can change this behavior with rewriting,

Try this, rewrite also works with query string query

{
  "query": {
    "wildcard": {
      "company": {
        "value": "digital*",
        "rewrite": "scoring_boolean"
      }
    }
  }
}

It has various options for scoring, see what fits your requirement.

EDIT 1, the reason you see score other than 1 for *lettering* *digital* is due to queryNorm, you can again check with explain api, If you look closely, all documents with both matches will have same score and documents with single match will have same score also.

P.S : leading wildcard is not recommended at all. You will get performance issues since it has to check against every single term in the inverted index. You might want to check edge ngram or ngram filter

Hope this helps!

like image 123
ChintanShah25 Avatar answered Sep 20 '22 08:09

ChintanShah25