Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy Search in Solr

I am working on a a fuzzy query using Solr, which goes over a repository of data which could have misspelled words or abbreviated words. For example the repository could have a name with words "Hlth" (abbreviated form of the word 'Health').

  1. If I do a fuzzy search for Name:'Health'~0.35 I get results with word 'Health' but not 'Hlth'.
  2. If I do a fuzzy search for Name:'Hlth'~0.35 I get records with names 'Health' and 'Hlth'.

I would like to get first query to work. In my bussiness use-case, I would have to use the clean data to query for all the misspelled or abbreviated words.

Could someone please help and throw some light on why #1 fuzzy search is not working and if there are any other ways of achieving the same.

like image 217
Ravi Avatar asked May 20 '13 18:05

Ravi


People also ask

What is meant by fuzzy search?

A fuzzy search searches for text that matches a term closely instead of exactly. Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term.

What is fuzzy search in Java?

Fuzzy search is an algorithm that is used to find patterns in strings(approximate string matching). It will show you the most relevant search results.


1 Answers

You use fuzzy query in a wrong way.

According to what Mike McCandless saying (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html):

FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other "close" terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

So you need to write queries like this - Health~2

like image 101
Mysterion Avatar answered Oct 21 '22 03:10

Mysterion