Fuzzy string search library in Java [closed]

People also ask

What is fuzzy search in Java?

Advertisements. FuzzyQuery is used to search documents using fuzzy implementation that is an approximate search based on the edit distance algorithm.

How is fuzzy logic implemented in Java?

Fuzzy logic is an abstract concept that is completely independant of programming lanuages. The basic idea is that instead of boolean logic where any statement is either "true" or "false", you use a continuum where a statement can be anywhere between "100% true" and "0% true".

What is fuzzy search example?

For example, if a user types "Misissippi" into Yahoo or Google -- both of which use fuzzy matching -- a list of hits is returned along with the question, "Did you mean Mississippi?" Alternative spellings and words that sound the same but are spelled differently are given.

What is fuzzy search how fuzzy search works?

A fuzzy search searches for text that matches a term closely instead of exactly. Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term.

Commons Lang has an implementation of Levenshtein distance.

Commons Codec has an implementation of soundex and metaphone.

If you are mostly comparing short strings and want something portable and lightweight you can use the well known python algorithm fuzzywuzzy ported to Java.

You can read more about it here

You can use Apache Lucene, but depending on the use case this may be too heavy weight. For very simple fuzzy searches it may be a bit complex to use and (correct me if I'm wrong) it requires you to build an index.

If you need a simple online (= not maintaining an index) algorithm you can use the fuzzy Bitap algorithm. I found an implementation in Java here. It's code fits in a single relatively short method with an almost self-explaining signature:

public static List<Integer> find(String doc, String pattern, int k)

Apache Commons StringUtils has an implementation of the Levenshtein algorithm for fuzzy String matching. It can be seen as the fuzzy version of String.equals, Bitap is like the fuzzy version of String.indexOf and still uses the Levenshtein distance measure. It is generally more efficient than naively using Levenshtein to compare the search pattern with each substring that could possibly match.

Notes:

The Bitap algorithm seems to be mostly useful for relatively small alphabets, e.g. plain ASCII. In fact the Simon Watiau version I linked to throws an ArrayIndexOutOfBoundsException on non-ASCII characters (>= 128) so you will have to filter these out.
I tried using Bimap in an application to search an in-memory list of persons by name. I found that a Levenhstein distance of 2 gives way too many false positives. A Levenhstein distance of 1 works better, but it cannot detect a typo where you swap two letters, e.g. "William" and "Willaim". I can think of a few ways to solve this, e.g.
1. do a fuzzy search only when an exact search finds no matches (and show a message to the user about this)
2. adjust Bitap to use Damerau-Levenshtein distance where a swap has distance 1 instead of 2. According to wikipedia, this is possible, but I could not find an existing implementation in Java.
3. instead of "contains" do a "startsWith". The fuzzy search tools contains a prefix version of Damerau-Levenshtein, but it gave me an ArrayIndexOutOfBoundsException
4. adjust the algorithm to introduce search result ranking where exact matches score higher
If you are going to do 2 or 4, it may be better to use a proper full-text search library like Lucene anyway.
More information on fuzzy search can be found on this blog. It's author also created an implementation in Java called BitapOnlineSearcher, but requires you to use java.io.Reader together with an Alphabet class. It's Javadoc is written in Russian.

SimMetrics is probably what you need: http://sourceforge.net/projects/simmetrics/

It has several algorithms for calculating various flavours of edit-distance.

Lucene is a very powerful full-text search engine, but FT search isn't exactly the same thing as fuzzy string matching (eg. given a list of strings find me the one that is most similar to some candidate string).

To Lucene I would add SOLR http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Related questions
                            
                                HttpClient 4.0.1 - how to release connection? [duplicate]
                            
                                Popup Message boxes
                            
                                Should I mark all methods virtual?
                            
                                IntelliJ IDE | .iml File lost or deleted
                            
                                How to install JDK 10 under Ubuntu?
                            
                                EditText.setFocusable(false); can't be set to true. :/
                            
                                javax.mail.AuthenticationFailedException is thrown while sending email in java
                            
                                What is the best way to extract the first word from a string in Java?
                            
                                Getting last day of the month in a given string date
                            
                                How do I set the colour of a label (coloured text) in Java?
                            
                                What is the use of static synchronized method in java?
                            
                                How to start up spring-boot application via command line?
                            
                                IntelliJ: Error: java: release version 10 not supported
                            
                                Is there functionality to generate a random character in Java?
                            
                                Open source face recognition for Android [closed]
                            
                                Why is "using getString() to get device identifiers" not recommended?
                            
                                Java REST implementation: Jersey vs CXF [closed]
                            
                                How to set the DPI of Java Swing apps on Windows/Linux?
                            
                                Missing iFrame view for Javadocs JDK 11+
                            
                                Concurrent JUnit testing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fuzzy string search library in Java [closed]

Tags:

java

nlp

fuzzy-search

People also ask

Recent Activity

Donate For Us