Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Forgiving/Fuzzy search with LINQ

Tags:

c#

linq

I'm attempting to implement a search against a DB that I inherited. The requirement states that the user must be able to search for an object by name. Unfortunately, an object could have multiple names associated with it. For example:

ID    Name
1     John and Jane Doe
2     Foo McFoo
3     Boo McBoo

It's easy enough to implement a search when a single name exists in each record:

var objects =  from x in db.Foo
               where x.Name.Contains("Foo McFoo")
               select x;

However, when multiple names exist, that approach does not work.

Question: Is it possible to write a search method that would return record one (John and Jane Doe) when someone uses the search terms John Doe or Jane Doe?

like image 422
James Hill Avatar asked Oct 31 '12 16:10

James Hill


4 Answers

You either need to pull the names out in to First/LastName columns or in to another table perhaps if there's multiple aliases.

But what I really think you should look at it something like Lucene if you need something 'forgiving' or 'fuzzy'

Question: Is it possible to write a search method that would return record one (John and Jane Doe) when someone uses the search terms John Doe or Jane Doe?

To be very specific to your question, you could convert "John Doe" to LIKE '%John%Doe' or "Jane Doe" to LIKE '%Jane%Doe' and this would retrieve that record. However I could see problems with names like "Johnathan Poppadoe".

like image 163
hometoast Avatar answered Oct 07 '22 03:10

hometoast


This would hurt performance, but how about this quick one:

string[] filters = "John Doe".Split(new[] {' '});
var objects =  from x in db.Foo
               where filters.All(f => x.Name.Contains(f))
               select x;

It seems to return what you would expect. Now you would tune it to behave nice when you have also a record "John Doe" as well as "John and Jane Doe".

Does this work for you?

like image 37
Vladimir Avatar answered Oct 07 '22 02:10

Vladimir


You could create a custom extension method named "ContainsFuzzy":

public static bool ContainsFuzzy(this string target, string text){
    // do the cheap stuff first
    if ( target == text ) return true;
    if ( target.Contains( text ) ) return true;
    // if the above don't return true, then do the more expensive stuff
    // such as splitting up the string or using a regex
}

Then your LINQ would at least be easier to read:

var objects =  from x in db.Foo
               where x.Name.ContainsFuzzy("Foo McFoo")
               select x;

The obvious disadvantage is that each call to ContainsFuzzy means recreating your split list, etc, so there's some overhead involved. You could create a class called FuzzySearch which at least would give you some increased effeciency:

class FuzzySearch{

    private string _searchTerm;
    private string[] _searchTerms;
    private Regex _searchPattern;

    public FuzzySearch( string searchTerm ){
        _searchTerm = searchTerm;
        _searchTerms = searchTerm.Split( new Char[] { ' ' } );
        _searchPattern = new Regex(
            "(?i)(?=.*" + String.Join(")(?=.*", _searchTerms) + ")");
    }

    public bool IsMatch( string value ){
        // do the cheap stuff first
        if ( _searchTerm == value ) return true;
        if ( value.Contains( _searchTerm ) ) return true;
        // if the above don't return true, then do the more expensive stuff
        if ( _searchPattern.IsMatch( value ) ) return true;
        // etc.
    }

}

Your LINQ:

FuzzySearch _fuzz = new FuzzySearch( "Foo McFoo" );

var objects =  from x in db.Foo
               where _fuzz.IsMatch( x.Name )
               select x;
like image 28
JDB Avatar answered Oct 07 '22 02:10

JDB


I wonder how no one mentioned the Levenshtein distance algorithm.

It's an algorithm the tells the distance between two strings by an int.

Here's a SO post that you can find some implementations of this algorithm.

So with a distance function of signature int Distance(string x, string y), you can filter out high distances and order your results so that low distances appear on top of your results, using LINQ.
Note that this is going to be performance costly.

like image 28
Shimmy Weitzhandler Avatar answered Oct 07 '22 04:10

Shimmy Weitzhandler