I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title. As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated. The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal solution for me. Anyway, it's not working for example i have the word "Samvel" <pre class="prettyprint"><code>queryStr max_expansions matches? samvel 0 Should not be 0. error (but levenshtein distance can be 0!) samvel 1 Yes samvvel 1 Yes samvvell 1 Yes (but it shouldn't have) samvelll 1 Yes (but it shouldn't have) saamvelll 1 No (but for some weird reason it matches with Samvelian) saamvelll anything bigger than 1 No </code></pre> The documentation says something I actually do not understand: <pre class="prettyprint"><code>Add max_expansions to the fuzzy query allowing to control the maximum number of terms to match. Default to unbounded (or bounded by the max clause count in boolean query). </code></pre> So can please anyone explain to me how exactly these parameters affect the search results.

The <code>min_similarity</code> is a value between zero and one. From the Lucene docs: <pre class="prettyprint"><code>For example, for a minimumSimilarity of 0.5 a term of the same length as the query term is considered similar to the query term if the edit distance between both terms is less than length(term)*0.5 </code></pre> The 'edit distance' that is referred to is the Levenshtein distance. The way this query works internally is: <ul> <li>it finds all terms that exist in the index that could match the search term, when taking the <code>min_similarity</code> into account</li> <li>then it searches for all of those terms.</li> </ul> You can imagine how heavy this query could be! To combat this, you can set the <code>max_expansions</code> parameter to specify the maximum number of matching terms that should be considered.

elasticsearch fuzzy matching max_expansions & min_similarity

Tags:

elasticsearch

fuzzy-comparison

fuzzy-search

fuzzy-logic

I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title.

As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated.

The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal solution for me. Anyway, it's not working for example i have the word "Samvel"

queryStr      max_expansions         matches?
samvel        0                      Should not be 0. error (but levenshtein distance   can be 0!)
samvel        1                      Yes
samvvel       1                      Yes
samvvell      1                      Yes (but it shouldn't have)
samvelll      1                      Yes (but it shouldn't have)
saamvelll     1                      No (but for some weird reason it matches with Samvelian)
saamvelll     anything bigger than 1 No

The documentation says something I actually do not understand:

Add max_expansions to the fuzzy query allowing to control the maximum number 
of terms to match. Default to unbounded (or bounded by the max clause count in 
boolean query).

So can please anyone explain to me how exactly these parameters affect the search results.

382

asked Aug 22 '11 13:08

Yervand Aghababyan

1 Answers

The min_similarity is a value between zero and one. From the Lucene docs:

For example, for a minimumSimilarity of 0.5 a term of the same length 
as the query term is considered similar to the query term if the edit 
distance between both terms is less than length(term)*0.5

The 'edit distance' that is referred to is the Levenshtein distance.

The way this query works internally is:

it finds all terms that exist in the index that could match the search term, when taking the min_similarity into account
then it searches for all of those terms.

You can imagine how heavy this query could be!

To combat this, you can set the max_expansions parameter to specify the maximum number of matching terms that should be considered.

180

answered Oct 03 '22 10:10

DrTech

Related questions
                            
                                How to delete document types in elasticsearch?
                            
                                Limit ElasticSearch aggregation to top n query results
                            
                                Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working
                            
                                What is the difference between searchkick and elasticsearch-rails?
                            
                                Slow index speed of Elasticsearch
                            
                                Mysql: 7 billions records in a table
                            
                                Elasticsearch - using the path hierarchy tokenizer to access different level of categories
                            
                                Elasticsearch Spring boot integration test
                            
                                elasticsearch: extract number from a field
                            
                                Best practice for handling many-to-many relationships in Elasticsearch?
                            
                                Is there a way to make elasticsearch case-insensitive without altering the existing documents?
                            
                                A simple AND query with Elasticsearch
                            
                                How to create value over time line chart in Kibana 4?
                            
                                How to find out the index creation date in elasticsearch
                            
                                query for one field doesn't equal another field in elasticsearch
                            
                                Default index analyzer in elasticsearch
                            
                                Disable dynamic mapping creation for only specific indexes on elasticsearch?
                            
                                Types cannot be provided in put mapping requests, unless the include_type_name parameter is set to true
                            
                                How to suggest (autocomplete) next word in elastic search?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With