ElasticSearch's Fuzzy Query

Tags:

I am brand new to ElasticSearch, and am currently exploring its features. One of them I am interested in is the Fuzzy Query, which I am testing and having troubles to use. It is probably a dummy question so I guess someone who already used this feature will quickly find the answer, at least I hope. :)

BTW I have the feeling that it might not be only related to ElasticSearch but maybe directly to Lucene.

Let's start with a new index named "first index" in which I store an object "label" with value "american football". This is the query I use.

bash-3.2$ curl -XPOST 'http://localhost:9200/firstindex/node/?pretty=true' -d '{   "node" : {     "label" : "american football"   } } '

This is the result I get.

{   "ok" : true,   "_index" : "firstindex",   "_type" : "node",   "_id" : "6TXNrLSESYepXPpFWjpl1A",   "_version" : 1 }

So far so good, now I want to find this entry using a fuzzy query. This is the one I send:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d '{   "query" : {     "fuzzy" : {       "label" : {         "value" : "american football",         "boost" : 1.0,         "min_similarity" : 0.0,         "prefix_length" : 0       }                            }        }    } '

And this is the result I get

{   "took" : 15,   "timed_out" : false,   "_shards" : {     "total" : 5,     "successful" : 5,     "failed" : 0   },   "hits" : {     "total" : 0,     "max_score" : null,     "hits" : [ ]   } }

As you can see, no hit. But now, when I shrink a bit my query's value from "american football" to "american footb" like this:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d ' {   "query" : {     "fuzzy" : {       "label" : {         "value" : "american footb",         "boost" : 1.0,         "min_similarity" : 0.0,         "prefix_length" : 0       }     }   } } '

Then I get a correct hit on my entry, thus the result is:

{   "took" : 0,   "timed_out" : false,   "_shards" : {     "total" : 5,     "successful" : 5,     "failed" : 0   },   "hits" : {     "total" : 1,     "max_score" : 0.19178301,     "hits" : [ {       "_index" : "firstindex",       "_type" : "node",       "_id" : "6TXNrLSESYepXPpFWjpl1A",       "_score" : 0.19178301, "_source" : {         "node" : {           "label" : "american football"         }       }     } ]   } }

So, I have several questions related to this test:

Why I didn't get any result when performing a query with a value completely equals the my only entry "american football"
Is it related to the fact that I have a multi-words value?
Is there a way to get the "similarity" score in my query result so I can understand better how to find the right threshold for my fuzzy queries
There is a page dedicated to Fuzzy Query on ElasticSearch web site, but I am not sure it lists all the potential parameters I can use for the fuzzy query. Were could I find such an exhaustive list?
Same question for the other queries actually.
is there a difference between a Fuzzy Query and a Query String Query using lucene syntax to get fuzzy matching?

904

asked Apr 25 '12 04:04

A_dit_rien

1 Answers

1.

The fuzzy query operates on terms. It cannot handle phrases because it doesn't analyze the text. So, in your example, elasticsearch tries to match the term "american football" to the term american and to the term football. The match between terms is based on Levenshtein distance, which is used to calculate similarity score. Since you have min_similarity=0.0 any term should match any term as long as edit distance is smaller than the size of the smallest term. In your case, the term "american football" has size 17 and the term "american" has size 8. The distance between these two terms is 9 which is bigger than the size of the smallest term 8. So, as a result, this term is getting rejected. The edit distance between "american footb" and "american" is 6. It's basically the term "american" with 6 additions at the end. That's why it produces results. With min_similarity=0.0 pretty much anything with edit distance 7 or less will match. You will even get results while searching for "aqqqqqq", for example.

2.

Yes, as I explained above, it is somewhat related to multi-word values. If you want to search for multiple terms, take a look at Fuzzy Like This Query and fuzziness parameter of Text Query

4 & 5.

Usually, the next best source of information after elasticsearch.org is elasticsearch source code.

answered Sep 28 '22 08:09

imotov

Related questions
                            
                                C# params with at least one value
                            
                                Can't locate LWP/Simple.pm in @INC [duplicate]
                            
                                How to add libgdx as a sub view in android
                            
                                Choosing between XAML's ListView and GridView in WinRT
                            
                                Where will Debug.WriteLine in C# output to when build release?
                            
                                AOP Exception Handling
                            
                                Why do we need to specify parameter name in interface?
                            
                                How to get a listing of key value pairs in an object? [duplicate]
                            
                                How to enable the new Objective-C object literals on iOS?
                            
                                Configure PostgreSQL to work for only LOCALHOST or specified ip + port [closed]
                            
                                Spring AOP target() vs this()
                            
                                Custom progress bar for <audio> and <progress> HTML5 elements

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With