Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.net Distinct() and complex conditons

Assume I have a class

public class Audio
{
    public string artist   { get; set; }
    public string title    { get; set; }
    // etc.
}

Now I want to filter duplicates in list of such audio's by similarity (not EXACT match) condition. Basicaly it's Levenstein distance with treshold correction by string total length. The problem is, general tip about IEqualityComparer is "Always implement both GetHashCode and Compare". I obviuosly can't calc distance between strings in GetHashCode because it's not a compare method at all. However in this case even similar audio's will return different hashes and Distinct() will treat it as different objects and compare() method does not fired.

I tried to force GetHashCode always returns 0, so Compare called for each-to-each object in collection, but this is slow. So, finally, a question: is there anything I can do with .net out of the box or should I search some good algorithm for filtering?

like image 361
Tommi Avatar asked Oct 04 '22 22:10

Tommi


1 Answers

I would suggest (first of all) not using Distinct or GetHashCode.

GetHashCode is too strict for your case (as @Gabe pointed out perfectly). What you could do is:

  1. Admit that you will have to compare an entire triangle (O(n^2) complexity) of instances pairs using the Levenshtein
  2. Try to optimise that using every trick in the book: How does calculating the Levenshtein distance from the empty string to the current one sound (that is for each and every instance of Audio and probably for separately for both string properties) ?

That could end up (one might say) with a darn good GetHashCode. But you can't use it like a GetHashCode, you should rather use it like so:

bool AreSimilar(Audio me, Audio you) {
  int cheapLevenshtein = Math.Abs(me.AbsoluteQuasiLevenshtein - you.AbsoluteQuasiLevenshtein);

  if (cheapLevenshtein < THRESHOLD) {

    int expensiveLevenshtein = Audio.LevenshteinBetween(me, you);
    var result = (expensiveLevenshtein < LIMIT);
    return result;

  } else
    return false;
}

And then you'd end up with a better or worse algorithm. This was just an idea and, of course: you can't use Distinct(). If you wish, you can write you very own extension method to make the whole thing look nice from a user programmer's perspective.

And yes the AbsoluteQuasiLevenshtein would be equal for things like: "ab" and "zy" but it would differ greatly between "ab" and "blahblahblahblah" and at least you would optimize things a bit. (The GetHashCode + Distinct approach posed an additional problem - the strictness of GetHashCode).

like image 180
Eduard Dumitru Avatar answered Oct 10 '22 03:10

Eduard Dumitru