Assume I have a class
public class Audio
{
public string artist { get; set; }
public string title { get; set; }
// etc.
}
Now I want to filter duplicates in list of such audio's by similarity (not EXACT match) condition. Basicaly it's Levenstein distance with treshold correction by string total length. The problem is, general tip about IEqualityComparer is "Always implement both GetHashCode and Compare". I obviuosly can't calc distance between strings in GetHashCode because it's not a compare method at all. However in this case even similar audio's will return different hashes and Distinct() will treat it as different objects and compare() method does not fired.
I tried to force GetHashCode always returns 0, so Compare called for each-to-each object in collection, but this is slow. So, finally, a question: is there anything I can do with .net out of the box or should I search some good algorithm for filtering?
I would suggest (first of all) not using Distinct or GetHashCode.
GetHashCode is too strict for your case (as @Gabe pointed out perfectly). What you could do is:
That could end up (one might say) with a darn good GetHashCode. But you can't use it like a GetHashCode, you should rather use it like so:
bool AreSimilar(Audio me, Audio you) {
int cheapLevenshtein = Math.Abs(me.AbsoluteQuasiLevenshtein - you.AbsoluteQuasiLevenshtein);
if (cheapLevenshtein < THRESHOLD) {
int expensiveLevenshtein = Audio.LevenshteinBetween(me, you);
var result = (expensiveLevenshtein < LIMIT);
return result;
} else
return false;
}
And then you'd end up with a better or worse algorithm. This was just an idea and, of course: you can't use Distinct(). If you wish, you can write you very own extension method to make the whole thing look nice from a user programmer's perspective.
And yes the AbsoluteQuasiLevenshtein would be equal for things like: "ab" and "zy" but it would differ greatly between "ab" and "blahblahblahblah" and at least you would optimize things a bit. (The GetHashCode + Distinct approach posed an additional problem - the strictness of GetHashCode).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With