I'm attempting to implement a search against a DB that I inherited. The requirement states that the user must be able to search for an object by name. Unfortunately, an object could have multiple names associated with it. For example:
ID Name 1 John and Jane Doe 2 Foo McFoo 3 Boo McBoo
It's easy enough to implement a search when a single name exists in each record:
var objects = from x in db.Foo
where x.Name.Contains("Foo McFoo")
select x;
However, when multiple names exist, that approach does not work.
Question: Is it possible to write a search method that would return record one (John and Jane Doe) when someone uses the search terms John Doe
or Jane Doe
?
You either need to pull the names out in to First/LastName columns or in to another table perhaps if there's multiple aliases.
But what I really think you should look at it something like Lucene if you need something 'forgiving' or 'fuzzy'
Question: Is it possible to write a search method that would return record one (John and Jane Doe) when someone uses the search terms John Doe or Jane Doe?
To be very specific to your question, you could convert "John Doe" to LIKE '%John%Doe'
or "Jane Doe" to LIKE '%Jane%Doe'
and this would retrieve that record. However I could see problems with names like "Johnathan Poppadoe".
This would hurt performance, but how about this quick one:
string[] filters = "John Doe".Split(new[] {' '});
var objects = from x in db.Foo
where filters.All(f => x.Name.Contains(f))
select x;
It seems to return what you would expect. Now you would tune it to behave nice when you have also a record "John Doe" as well as "John and Jane Doe".
Does this work for you?
You could create a custom extension method named "ContainsFuzzy":
public static bool ContainsFuzzy(this string target, string text){
// do the cheap stuff first
if ( target == text ) return true;
if ( target.Contains( text ) ) return true;
// if the above don't return true, then do the more expensive stuff
// such as splitting up the string or using a regex
}
Then your LINQ would at least be easier to read:
var objects = from x in db.Foo
where x.Name.ContainsFuzzy("Foo McFoo")
select x;
The obvious disadvantage is that each call to ContainsFuzzy means recreating your split list, etc, so there's some overhead involved. You could create a class called FuzzySearch which at least would give you some increased effeciency:
class FuzzySearch{
private string _searchTerm;
private string[] _searchTerms;
private Regex _searchPattern;
public FuzzySearch( string searchTerm ){
_searchTerm = searchTerm;
_searchTerms = searchTerm.Split( new Char[] { ' ' } );
_searchPattern = new Regex(
"(?i)(?=.*" + String.Join(")(?=.*", _searchTerms) + ")");
}
public bool IsMatch( string value ){
// do the cheap stuff first
if ( _searchTerm == value ) return true;
if ( value.Contains( _searchTerm ) ) return true;
// if the above don't return true, then do the more expensive stuff
if ( _searchPattern.IsMatch( value ) ) return true;
// etc.
}
}
Your LINQ:
FuzzySearch _fuzz = new FuzzySearch( "Foo McFoo" );
var objects = from x in db.Foo
where _fuzz.IsMatch( x.Name )
select x;
I wonder how no one mentioned the Levenshtein distance algorithm.
It's an algorithm the tells the distance between two strings by an int.
Here's a SO post that you can find some implementations of this algorithm.
So with a distance function of signature int Distance(string x, string y)
, you can filter out high distances and order your results so that low distances appear on top of your results, using LINQ.
Note that this is going to be performance costly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With