Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch's Fuzzy Query

Tags:

I am brand new to ElasticSearch, and am currently exploring its features. One of them I am interested in is the Fuzzy Query, which I am testing and having troubles to use. It is probably a dummy question so I guess someone who already used this feature will quickly find the answer, at least I hope. :)

BTW I have the feeling that it might not be only related to ElasticSearch but maybe directly to Lucene.

Let's start with a new index named "first index" in which I store an object "label" with value "american football". This is the query I use.

bash-3.2$ curl -XPOST 'http://localhost:9200/firstindex/node/?pretty=true' -d '{   "node" : {     "label" : "american football"   } } ' 

This is the result I get.

{   "ok" : true,   "_index" : "firstindex",   "_type" : "node",   "_id" : "6TXNrLSESYepXPpFWjpl1A",   "_version" : 1 } 

So far so good, now I want to find this entry using a fuzzy query. This is the one I send:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d '{   "query" : {     "fuzzy" : {       "label" : {         "value" : "american football",         "boost" : 1.0,         "min_similarity" : 0.0,         "prefix_length" : 0       }                            }        }    } ' 

And this is the result I get

{   "took" : 15,   "timed_out" : false,   "_shards" : {     "total" : 5,     "successful" : 5,     "failed" : 0   },   "hits" : {     "total" : 0,     "max_score" : null,     "hits" : [ ]   } } 

As you can see, no hit. But now, when I shrink a bit my query's value from "american football" to "american footb" like this:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d ' {   "query" : {     "fuzzy" : {       "label" : {         "value" : "american footb",         "boost" : 1.0,         "min_similarity" : 0.0,         "prefix_length" : 0       }     }   } } ' 

Then I get a correct hit on my entry, thus the result is:

{   "took" : 0,   "timed_out" : false,   "_shards" : {     "total" : 5,     "successful" : 5,     "failed" : 0   },   "hits" : {     "total" : 1,     "max_score" : 0.19178301,     "hits" : [ {       "_index" : "firstindex",       "_type" : "node",       "_id" : "6TXNrLSESYepXPpFWjpl1A",       "_score" : 0.19178301, "_source" : {         "node" : {           "label" : "american football"         }       }     } ]   } } 

So, I have several questions related to this test:

  1. Why I didn't get any result when performing a query with a value completely equals the my only entry "american football"

  2. Is it related to the fact that I have a multi-words value?

  3. Is there a way to get the "similarity" score in my query result so I can understand better how to find the right threshold for my fuzzy queries

  4. There is a page dedicated to Fuzzy Query on ElasticSearch web site, but I am not sure it lists all the potential parameters I can use for the fuzzy query. Were could I find such an exhaustive list?

  5. Same question for the other queries actually.

  6. is there a difference between a Fuzzy Query and a Query String Query using lucene syntax to get fuzzy matching?

like image 904
A_dit_rien Avatar asked Apr 25 '12 04:04

A_dit_rien


People also ask

What is fuzzy query in Elasticsearch?

Fuzzy queryedit. Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance. An edit distance is the number of one-character changes needed to turn one term into another.

What is fuzzy matching example?

Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.

What is Max_expansions?

The max_expansions setting, which defines the maximum number of terms the fuzzy query will match before halting the search, can also have dramatic effects on the performance of a fuzzy query.

How does Elasticsearch match query work?

The match query analyzes any provided text before performing a search. This means the match query can search text fields for analyzed tokens rather than an exact term. (Optional, string) Analyzer used to convert the text in the query value into tokens. Defaults to the index-time analyzer mapped for the <field> .


1 Answers

1.

The fuzzy query operates on terms. It cannot handle phrases because it doesn't analyze the text. So, in your example, elasticsearch tries to match the term "american football" to the term american and to the term football. The match between terms is based on Levenshtein distance, which is used to calculate similarity score. Since you have min_similarity=0.0 any term should match any term as long as edit distance is smaller than the size of the smallest term. In your case, the term "american football" has size 17 and the term "american" has size 8. The distance between these two terms is 9 which is bigger than the size of the smallest term 8. So, as a result, this term is getting rejected. The edit distance between "american footb" and "american" is 6. It's basically the term "american" with 6 additions at the end. That's why it produces results. With min_similarity=0.0 pretty much anything with edit distance 7 or less will match. You will even get results while searching for "aqqqqqq", for example.

2.

Yes, as I explained above, it is somewhat related to multi-word values. If you want to search for multiple terms, take a look at Fuzzy Like This Query and fuzziness parameter of Text Query

4 & 5.

Usually, the next best source of information after elasticsearch.org is elasticsearch source code.

like image 51
imotov Avatar answered Sep 28 '22 08:09

imotov