Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to String.Contains() the Fuzzy way in C#?

I have a list of persons that I want to search for while filtering. Each time the user enters a search string, the filtering is applied.

There are two challenges to consider:

  1. The user may enter part of names
  2. The user may mistyping

The first one is simply resolved by searching for substrings e.g. String.Contains(). The second one could be resolved by using a Fuzzy Implementation (e.g. https://fuzzystring.codeplex.com)

But I don't know how to master both challenges simultaneously.

For example: I want to find the person "Dr. Martin Fowler" when entering one of:

  • "Martin"
  • "Fawler"
  • "Marten Fauler"

I guess I need to write a "FuzzyContains()" logic, that handle my needs and also has an acceptable performance. Any advices how to start?

like image 362
mamuesstack Avatar asked Dec 26 '22 04:12

mamuesstack


2 Answers

I modified Oliver answer who suggested the Levenshtein Distance algorithms, that is not the best choice here, since the calculated distance is to big when only parts of the names were entered. So, I ended up using the Longest Common Subsequence algorithm implemented by the awesome FuzzyString Lib.

const int TOLERANCE = 1;
string userInput = textInput.Text.ToLower();
var matchingPeople = people.Where(p =>
{
     //Check Contains
     bool contains = p.Name.ToLower().Contains(userInput);
     if(contains) return true;

     //Check LongestCommonSubsequence
     bool subsequenceTolerated = p.Name.LongestCommonSubsequence(userInput).Length >= userInput.Length - TOLERANCE;

     return subsequenceTolerated;
}).ToList();
like image 133
mamuesstack Avatar answered Dec 27 '22 19:12

mamuesstack


Seems to be a job for the Levenshtein distance algorithm (one of the dozens C# implementations).

You give this algorithm two strings (the one the user entered and one out of your list). Then it calculates how much characters must be replaced, added or removed to come from the first string to the second one. Then you can take all elements from your list where the distance is smaller or equal three (for example) to find simple typos.

If you have this method you could maybe use it that way:

var userInput = textInput.Text.ToLower();
var matchingEmployees = EmployeeList.Where(x =>
    x.Name.ToLower().Contains(userInput) ||
    LevenshteinDistance.Compute(x.Name.ToLower(), userInput) <= 3).ToList();
like image 30
Oliver Avatar answered Dec 27 '22 17:12

Oliver