Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy Text Matching C#

I'm writing a desktop UI (.Net WinForms) to assist a photographer clean up his image meta data. There is a list of 66k+ phrases. Can anyone suggest a good open source/free .NET component I can use that employs some sort of algorithm to identify potential candiates for consolidation? For example there may be two or more entries which are actually the same word or phrase that only differ by whitespace or punctuation or even slight mis-spelling. The application will ultimately rely on the user to action the consolidation of phrases but having an effective way to automatically find potential candidates will prove invaluable.

like image 749
Myles McDonnell Avatar asked Nov 21 '11 21:11

Myles McDonnell


People also ask

What is fuzzy data matching?

Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.

Is fuzzy matching good?

Fuzzy matching allows you to identify non-exact matches of your target item. It is the foundation stone of many search engine frameworks and one of the main reasons why you can get relevant search results even if you have a typo in your query or a different verbal tense.

How do I test fuzzy search?

Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term. For example the search term bank~ will return rows that contain tank , benk or banks .


2 Answers

Let me introduce you to the Levenshtein distance formula. It is awesome:

http://en.wikipedia.org/wiki/Levenshtein_distance

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.

Personally I used this in a healthcare setting, where Provider names were checked for duplicates. Using the Levenshtein process, we gave them a confidence rating and allowed them to determine if it was a true duplicate or something unique.

like image 88
Fosco Avatar answered Oct 24 '22 07:10

Fosco


I know this is an old question, but I feel like this answer can help people who are dealing with the same issue in current time.

Please have a look at https://github.com/JakeBayer/FuzzySharp

It is a c# NuGet package that has multiple methods that implement a certain way of fuzzy search. Not sure, but perhaps Fosco's anwer is also used in one of them.

Edit: I just noticed a comment about this package, but I think it deserves a better place inside this question

like image 38
Daniël Tulp Avatar answered Oct 24 '22 05:10

Daniël Tulp