Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch: Is it possible to give a lower score for fuzziness?

I'm running a multi_match (with most_fields and "fuzziness": "AUTO") query for "Rob", but I get a result with "Ron" before "Rob".

If I remove the fuzziness, it shows Rob only, not Ron. However, I do want to use the fuzziness, I just expect all results that are exact match to be more relevant and to be shown first. It's not happening. Investigating the 'explain', shows that the IDF of 'Ron' is a bit higher.

Back to my question - is it possible to configure some 'boost' or 'score' to the fuzziness element?

like image 832
David Avatar asked Feb 09 '16 19:02

David


People also ask

Does Elasticsearch do fuzzy matching?

In Elasticsearch, fuzzy query means the terms are not the exact matches of the index. The result is 2, but you can use fuzziness to find the correct word for a typo in Elasticsearch's fuzzy in Match Query. For 6 characters, the Elasticsearch by default will allow 2 edit distance.

How does Elasticsearch fuzzy search work?

To find similar terms, the fuzzy query creates a set of all possible variations, or expansions, of the search term within a specified edit distance. The query then returns exact matches for each expansion.

How does Elasticsearch calculate score?

The default scoring algorithm used by Elasticsearch is BM25. There are three main factors that determine a document's score: Term frequency (TF) — The more times that a search term appears in the field we are searching in a document, the more relevant that document is.

How do you do a fuzzy search?

Many search engines enable users to specifically request a fuzzy search in the search query by using a tilde (~) at the end of the word or term they want to search with fuzziness.

What is the Elasticsearch score?

The score represents how relevant a given document is for a specific query. The default scoring algorithm used by Elasticsearch is BM25. There are three main factors that determine a document’s score: Term frequency (TF) — The more times that a search term appears in the field we are searching in a document, the more relevant that document is.

What is fuzzy in Elasticsearch?

Please note that Found is now known as Elastic Cloud. Elasticsearch's Fuzzy query is a powerful tool for a multitude of situations. Username searches, misspellings, and other funky problems can oftentimes be solved with this unconventional query.

What is the metric used in a fuzzy product search?

POST /fuzzy_products/product/_search { "query": { "match": { "name": { "query": "Vacuummm", "fuzziness": 2, "prefix_length": 1 } } } } The metric used by fuzzy queries to determine a match is the Damerau-Levenshtein distance formula.

How can Elasticsearch be fine-tuned to provide the most relevant results?

Because Elasticsearch is super flexible, it can be fine-tuned to provide the most relevant search results for your specific use case (s). One relatively straightforward way to fine-tune results is by providing additional clauses in the queries that are sent to Elasticsearch.


2 Answers

OK, I ended up with the following based on what suggested here: https://medium.com/@oysterpail/fuzzy-queries-ae47b66b325c#.a4uxw5z0b

Their solution is using a bool query of should. I can't do it as I need this part of the query to be must (I use the should part for relevancy), and a bool query of must is actually AND. However, must + or did the trick:

{
   "query":{
      "bool":{
         "must":{
            "or":[
               {
                  "multi_match":{
                     "query":"rob",
                     "fields":[
                        "username",
                        "firstName",
                        "lastName"
                     ],
                     "type":"most_fields",
                     "fuzziness":"AUTO"
                  }
               },
               {
                  "multi_match":{
                     "query":"rob",
                     "fields":[
                        "username",
                        "firstName",
                        "lastName"
                     ],
                     "type":"most_fields"
                  }
               }
            ]
         }
      }
   }
}

This way, the results coming from the fuzziness part, have a match only to the first part of the query, whereas the exact-match results have a match to both parts, therefore they are showing up first.

like image 91
David Avatar answered Oct 04 '22 15:10

David


quite an old question but I'll answer to help others looking at it in the present. Well the reason you are getting 'Ron' before 'Rob' is because of the TF/IDF algorithm. In your dataset the word 'Rob' has more occurrence than 'Ron' so the algorithm will give a lower score to 'Rob'.

If you just want to search for names then you can use a different scoring algorithm or similarity. In your case a 'boolean' similarity should work.

like image 23
Sourav Patra Avatar answered Oct 04 '22 15:10

Sourav Patra