Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fuzzy search with lucene

I implemented a fuzzy search with lucene 4.3.1 but i'm not satisfied with the result. I would like to specify a number of results it should return. So for example if I want 10 results, it should return the 10 best matches, no matter how bad they are. Most of the time it returns nothing if the word I search for is very different from anything in the index. How can I achieve more/fuzzier results?

Here the code I have:

    public String[] luceneQuery(String query, int numberOfHits, String path)
        throws ParseException, IOException {

    File dir = new File(path);
    Directory index = FSDirectory.open(dir);

    query = query + "~";
    Query q = new QueryParser(Version.LUCENE_43, "label", analyzer)
            .parse(query);

    IndexReader reader = DirectoryReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);

    Query fuzzyQuery = new FuzzyQuery(new Term("label", query), 2);

    ScoreDoc[] fuzzyHits = searcher.search(fuzzyQuery, numberOfHits).scoreDocs;
    String[] fuzzyResults = new String[fuzzyHits.length];

    for (int i = 0; i < fuzzyHits.length; ++i) {
        int docId = fuzzyHits[i].doc;
        Document d = searcher.doc(docId);
        fuzzyResults[i] = d.get("label");
    }

    reader.close();
    return fuzzyResults;
}
like image 554
tadumtada Avatar asked Jul 19 '13 12:07

tadumtada


People also ask

How do you do a fuzzy search?

A fuzzy search searches for text that matches a term closely instead of exactly. Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term.

How do you search in Lucene?

Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the "?" symbol. To perform a multiple character wildcard search use the "*" symbol. You can also use the wildcard searches in the middle of a term.

What is fuzzy matching in Elasticsearch?

In the Elasticsearch, fuzzy query means the terms in the queries don't have to be the exact match with the terms in the Inverted Index. To calculate the distance between query, Elasticsearch uses Levenshtein Distance Algorithm.

What are Lucene special characters?

You can't search for special characters in Lucene Search. These are + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ / @.


1 Answers

large edit distances are no longer supported by FuzzyQuery in Lucene 4.x. The current implementation of FuzzyQuery is a huge improvement on performance from the Lucene 3.x implementation, but only supports two edits. Distances greater than 2 Damerau–Levenshtein edits are considered to rarely be really useful.

According to the FuzzyQuery documentation, if you really must have higher edit distances:

If you really want this, consider using an n-gram indexing technique (such as the SpellChecker in the suggest module) instead.

The strong implication is that you should rethink what your trying to accomplish, and find a more useful approach.

like image 102
femtoRgon Avatar answered Nov 04 '22 00:11

femtoRgon