I am brand new to ElasticSearch, and am currently exploring its features. One of them I am interested in is the Fuzzy Query, which I am testing and having troubles to use. It is probably a dummy question so I guess someone who already used this feature will quickly find the answer, at least I hope. :)
BTW I have the feeling that it might not be only related to ElasticSearch but maybe directly to Lucene.
Let's start with a new index named "first index" in which I store an object "label" with value "american football". This is the query I use.
bash-3.2$ curl -XPOST 'http://localhost:9200/firstindex/node/?pretty=true' -d '{ "node" : { "label" : "american football" } } '
This is the result I get.
{ "ok" : true, "_index" : "firstindex", "_type" : "node", "_id" : "6TXNrLSESYepXPpFWjpl1A", "_version" : 1 }
So far so good, now I want to find this entry using a fuzzy query. This is the one I send:
bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d '{ "query" : { "fuzzy" : { "label" : { "value" : "american football", "boost" : 1.0, "min_similarity" : 0.0, "prefix_length" : 0 } } } } '
And this is the result I get
{ "took" : 15, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } }
As you can see, no hit. But now, when I shrink a bit my query's value from "american football" to "american footb" like this:
bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d ' { "query" : { "fuzzy" : { "label" : { "value" : "american footb", "boost" : 1.0, "min_similarity" : 0.0, "prefix_length" : 0 } } } } '
Then I get a correct hit on my entry, thus the result is:
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.19178301, "hits" : [ { "_index" : "firstindex", "_type" : "node", "_id" : "6TXNrLSESYepXPpFWjpl1A", "_score" : 0.19178301, "_source" : { "node" : { "label" : "american football" } } } ] } }
So, I have several questions related to this test:
Why I didn't get any result when performing a query with a value completely equals the my only entry "american football"
Is it related to the fact that I have a multi-words value?
Is there a way to get the "similarity" score in my query result so I can understand better how to find the right threshold for my fuzzy queries
There is a page dedicated to Fuzzy Query on ElasticSearch web site, but I am not sure it lists all the potential parameters I can use for the fuzzy query. Were could I find such an exhaustive list?
Same question for the other queries actually.
is there a difference between a Fuzzy Query and a Query String Query using lucene syntax to get fuzzy matching?
Fuzzy queryedit. Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance. An edit distance is the number of one-character changes needed to turn one term into another.
Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.
The max_expansions setting, which defines the maximum number of terms the fuzzy query will match before halting the search, can also have dramatic effects on the performance of a fuzzy query.
The match query analyzes any provided text before performing a search. This means the match query can search text fields for analyzed tokens rather than an exact term. (Optional, string) Analyzer used to convert the text in the query value into tokens. Defaults to the index-time analyzer mapped for the <field> .
The fuzzy query operates on terms. It cannot handle phrases because it doesn't analyze the text. So, in your example, elasticsearch tries to match the term "american football" to the term american and to the term football. The match between terms is based on Levenshtein distance, which is used to calculate similarity score. Since you have min_similarity=0.0 any term should match any term as long as edit distance is smaller than the size of the smallest term. In your case, the term "american football" has size 17 and the term "american" has size 8. The distance between these two terms is 9 which is bigger than the size of the smallest term 8. So, as a result, this term is getting rejected. The edit distance between "american footb" and "american" is 6. It's basically the term "american" with 6 additions at the end. That's why it produces results. With min_similarity=0.0 pretty much anything with edit distance 7 or less will match. You will even get results while searching for "aqqqqqq", for example.
Yes, as I explained above, it is somewhat related to multi-word values. If you want to search for multiple terms, take a look at Fuzzy Like This Query and fuzziness parameter of Text Query
Usually, the next best source of information after elasticsearch.org is elasticsearch source code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With