I implemented a fuzzy search with lucene 4.3.1 but i'm not satisfied with the result. I would like to specify a number of results it should return. So for example if I want 10 results, it should return the 10 best matches, no matter how bad they are. Most of the time it returns nothing if the word I search for is very different from anything in the index. How can I achieve more/fuzzier results?
Here the code I have:
public String[] luceneQuery(String query, int numberOfHits, String path)
throws ParseException, IOException {
File dir = new File(path);
Directory index = FSDirectory.open(dir);
query = query + "~";
Query q = new QueryParser(Version.LUCENE_43, "label", analyzer)
.parse(query);
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
Query fuzzyQuery = new FuzzyQuery(new Term("label", query), 2);
ScoreDoc[] fuzzyHits = searcher.search(fuzzyQuery, numberOfHits).scoreDocs;
String[] fuzzyResults = new String[fuzzyHits.length];
for (int i = 0; i < fuzzyHits.length; ++i) {
int docId = fuzzyHits[i].doc;
Document d = searcher.doc(docId);
fuzzyResults[i] = d.get("label");
}
reader.close();
return fuzzyResults;
}
A fuzzy search searches for text that matches a term closely instead of exactly. Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term.
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.
In the Elasticsearch, fuzzy query means the terms in the queries don't have to be the exact match with the terms in the Inverted Index. To calculate the distance between query, Elasticsearch uses Levenshtein Distance Algorithm.
You can't search for special characters in Lucene Search. These are + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ / @.
large edit distances are no longer supported by FuzzyQuery
in Lucene 4.x. The current implementation of FuzzyQuery
is a huge improvement on performance from the Lucene 3.x implementation, but only supports two edits. Distances greater than 2 Damerau–Levenshtein edits are considered to rarely be really useful.
According to the FuzzyQuery
documentation, if you really must have higher edit distances:
If you really want this, consider using an n-gram indexing technique (such as the SpellChecker in the suggest module) instead.
The strong implication is that you should rethink what your trying to accomplish, and find a more useful approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With