I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title.
As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated.
The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal solution for me. Anyway, it's not working for example i have the word "Samvel"
queryStr max_expansions matches?
samvel 0 Should not be 0. error (but levenshtein distance can be 0!)
samvel 1 Yes
samvvel 1 Yes
samvvell 1 Yes (but it shouldn't have)
samvelll 1 Yes (but it shouldn't have)
saamvelll 1 No (but for some weird reason it matches with Samvelian)
saamvelll anything bigger than 1 No
The documentation says something I actually do not understand:
Add max_expansions to the fuzzy query allowing to control the maximum number
of terms to match. Default to unbounded (or bounded by the max clause count in
boolean query).
So can please anyone explain to me how exactly these parameters affect the search results.
In Elasticsearch, fuzzy query means the terms are not the exact matches of the index. The result is 2, but you can use fuzziness to find the correct word for a typo in Elasticsearch's fuzzy in Match Query. For 6 characters, the Elasticsearch by default will allow 2 edit distance.
The max_expansions setting, which defines the maximum number of terms the fuzzy query will match before halting the search, can also have dramatic effects on the performance of a fuzzy query.
Fuzzy queryedit. Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance. An edit distance is the number of one-character changes needed to turn one term into another.
The match query analyzes any provided text before performing a search. This means the match query can search text fields for analyzed tokens rather than an exact term.
The min_similarity
is a value between zero and one. From the Lucene docs:
For example, for a minimumSimilarity of 0.5 a term of the same length
as the query term is considered similar to the query term if the edit
distance between both terms is less than length(term)*0.5
The 'edit distance' that is referred to is the Levenshtein distance.
The way this query works internally is:
min_similarity
into accountYou can imagine how heavy this query could be!
To combat this, you can set the max_expansions
parameter to specify the maximum number of matching terms that should be considered.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With