We are using Elasticsearch to search for the most relevant companies in a specific catalog. When we use the normal search term like <code>lettering</code> we get reasonable scores and can sort the results according to the score. However, when we modify the search term before querying and make the "starred" version of it (e.g., <code>*lettering*</code>) to be able to search for substrings we get a score of 1.0 for every result. The search for substrings is a requirement in the project. Any ideas on what could cause this relevance computation? The problem occurs only when a single term is used. We get comprehensible scores when we use two starred terms in combination (e.g., <code>*lettering* *digital*</code>). EDIT 1: Exemplary mapping (YAML, other properties are mapped in the same way, excepting boost which is different for each property): <pre class="prettyprint"><code> elasticSearchMapping: type: object include_in_all: true enabled: true properties: 'keywords': type: string include_in_all: true boost: 50 </code></pre> Query: <pre class="prettyprint"><code>{ "query": { "filtered": { "query": { "bool": { "must": [{ "match_all": [] }, { "query_string": { "query": "*lettering*" } }] } }, "filter": { "bool": { "must": [{ "term": { "__parentPath": "/sites/industrycatalog" } }, { "terms": { "__workspace": ["live"] } }, { "term": { "__dimensionCombinationHash": "d751713988987e9331980363e24189ce" } }, { "term": { "__typeAndSupertypes": "IndustryCatalog:Entry" } }], "should": [], "must_not": [{ "term": { "_hidden": true } }, { "range": { "_hiddenBeforeDateTime": { "gt": "now" } } }, { "range": { "_hiddenAfterDateTime": { "lt": "now" } } }] } } } }, "fields": ["__path"], "script_fields": { "distance": { "script": "doc['coordinates'].distanceInKm(51.75631079999999,14.332867899999997)" } }, "sort": [{ "customer.featureFlags.industrycatalog": { "order": "asc" } }, { "_geo_distance": { "coordinates": { "lat": "51.75631079999999", "lon": "14.332867899999997" }, "order": "asc", "unit": "km", "distance_type": "plane" } }], "size": 999999 </code></pre> }

What you are doing is <code>wildcard query</code>, They fall under <code>term level queries</code> and by default <code>constant score</code> is applied. Check the Lucene Documentation, <code>WildcardQuery</code> extends <code>MultiTermQuery</code> You can also verify this with the help of explain api, you will something like this <pre class="prettyprint"><code>"_explanation": { "value": 1, "description": "ConstantScore(company:lettering), product of:", "details": [{ "value": 1, "description": "boost" }, { "value": 1, "description": "queryNorm" }] } </code></pre> You can change this behavior with rewriting, Try this, <code>rewrite</code> also works with <code>query string query</code> <pre class="prettyprint"><code>{ "query": { "wildcard": { "company": { "value": "digital*", "rewrite": "scoring_boolean" } } } } </code></pre> It has various options for scoring, see what fits your requirement. EDIT 1, the reason you see score other than 1 for <code>*lettering* *digital*</code> is due to <code>queryNorm</code>, you can again check with <code>explain api</code>, If you look closely, all documents with both matches will have same score and documents with single match will have same score also. P.S : leading wildcard is not recommended at all. You will get performance issues since it has to check against every single term in the <code>inverted index</code>. You might want to check edge ngram or ngram filter Hope this helps!

Elasticsearch outputs the score of 1.0 for all results when searching for a single "starred" term

Tags:

elasticsearch

We are using Elasticsearch to search for the most relevant companies in a specific catalog. When we use the normal search term like lettering we get reasonable scores and can sort the results according to the score.

However, when we modify the search term before querying and make the "starred" version of it (e.g., *lettering*) to be able to search for substrings we get a score of 1.0 for every result. The search for substrings is a requirement in the project.

Any ideas on what could cause this relevance computation? The problem occurs only when a single term is used. We get comprehensible scores when we use two starred terms in combination (e.g., *lettering* *digital*).

EDIT 1:

Exemplary mapping (YAML, other properties are mapped in the same way, excepting boost which is different for each property):

    elasticSearchMapping:
      type: object
      include_in_all: true
      enabled: true
      properties:
        'keywords':
          type: string
          include_in_all: true
          boost: 50

Query:

{
"query": {
    "filtered": {
        "query": {
            "bool": {
                "must": [{
                    "match_all": []
                }, {
                    "query_string": {
                        "query": "*lettering*"
                    }
                }]
            }
        },
        "filter": {
            "bool": {
                "must": [{
                    "term": {
                        "__parentPath": "/sites/industrycatalog"
                    }
                }, {
                    "terms": {
                        "__workspace": ["live"]
                    }
                }, {
                    "term": {
                        "__dimensionCombinationHash": "d751713988987e9331980363e24189ce"
                    }
                }, {
                    "term": {
                        "__typeAndSupertypes": "IndustryCatalog:Entry"
                    }
                }],
                "should": [],
                "must_not": [{
                    "term": {
                        "_hidden": true
                    }
                }, {
                    "range": {
                        "_hiddenBeforeDateTime": {
                            "gt": "now"
                        }
                    }
                }, {
                    "range": {
                        "_hiddenAfterDateTime": {
                            "lt": "now"
                        }
                    }
                }]
            }
        }
    }
},
"fields": ["__path"],
"script_fields": {
    "distance": {
        "script": "doc['coordinates'].distanceInKm(51.75631079999999,14.332867899999997)"
    }
},
"sort": [{
    "customer.featureFlags.industrycatalog": {
        "order": "asc"
    }
}, {
    "_geo_distance": {
        "coordinates": {
            "lat": "51.75631079999999",
            "lon": "14.332867899999997"
        },
        "order": "asc",
        "unit": "km",
        "distance_type": "plane"
    }
}],
"size": 999999

}

756

asked Jan 08 '16 17:01

cenetp

1 Answers

What you are doing is wildcard query, They fall under term level queries and by default constant score is applied.

Check the Lucene Documentation, WildcardQuery extends MultiTermQuery

You can also verify this with the help of explain api, you will something like this

"_explanation": {
     "value": 1,
     "description": "ConstantScore(company:lettering), product of:",
     "details": [{
         "value": 1,
         "description": "boost"
     }, {
         "value": 1,
         "description": "queryNorm"
     }]
 }

You can change this behavior with rewriting,

Try this, rewrite also works with query string query

{
  "query": {
    "wildcard": {
      "company": {
        "value": "digital*",
        "rewrite": "scoring_boolean"
      }
    }
  }
}

It has various options for scoring, see what fits your requirement.

EDIT 1, the reason you see score other than 1 for *lettering* *digital* is due to queryNorm, you can again check with explain api, If you look closely, all documents with both matches will have same score and documents with single match will have same score also.

P.S : leading wildcard is not recommended at all. You will get performance issues since it has to check against every single term in the inverted index. You might want to check edge ngram or ngram filter

Hope this helps!

123

answered Sep 20 '22 08:09

ChintanShah25

Related questions
                            
                                How do I check for duplicate data on ElasticSearch?
                            
                                Bulk update via elastic search Java API throws exception
                            
                                Special Character "- " in the Elastic Search acting
                            
                                Restrict ElasticSearch aggregation by type?
                            
                                Elasticsearch Indexing by BulkRequestBuilder getting slow down
                            
                                Elasticsearch: How to prevent the increase of score when search term appears multiple times in document?
                            
                                Showing Different Document Types in Kibana from ElasticSearch
                            
                                Populating FosElasticaBundle running out of php memory, possible memory leak?
                            
                                Elasticsearch.net client can't do basic search
                            
                                How can I "pass through" the raw json response from a NEST Elasticsearch query?
                            
                                Getting ElasticSearch Percolator Queries
                            
                                elasticsearch customize score for synonyms/stemming
                            
                                ElasticSearch Golang
                            
                                Ignore leading zeros with Elasticsearch
                            
                                Elasticsearch array property must contain given array items
                            
                                Elastic Search MUST + at least one SHOULD in percolator query
                            
                                Elasticsearch wait for index status
                            
                                Elastic search multiple simple_query_string with boost
                            
                                Missing data when using unique count and creating an aggregation in Kibana
                            
                                How to search emoticon/emoji in elasticsearch?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With