Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding duplicates within list of list

Tags:

c#

algorithm

Simple situation. I have a list of lists, almost table like, and I am trying to find out if any of the lists are duplicated.

Example:

List<List<int>> list = new List<List<int>>(){
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
  new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};

I would like to know that there are 4 total items, 2 of which are duplicates. I was thinking about doing something like a SQL checksum but I didn't know if there was a better/easier way.

I care about performance, and I care about ordering.

Additional Information That May Help

  • Things inserted into this list will never be removed
  • Not bound to any specific collection.
  • Dont care about function signature
  • They type is not restricted to int
like image 487
Nix Avatar asked Aug 24 '10 19:08

Nix


3 Answers

Let's try to get best performace. if n is number of lists and m is length of lists then we can get O(nm + nlogn + n) plus some probability of hash codes to be equal for different lists.

Major steps:

  1. Calculate hash codes*
  2. Sort them
  3. Go over list to find dupes

* this is important step. for simlicity you can calc hash as = ... ^ (list[i] << i) ^ (list[i + 1] << (i + 1))

Edit for those people that think that PLINQ can boost the thing, but not good algorythm. PLINQ can also be added here, because all the steps are easily parallelizable.

My code:

static public void Main()
{
    List<List<int>> list = new List<List<int>>(){
      new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
      new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
      new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
      new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
    };
    var hashList = list.Select((l, ind) =>
    {
        uint hash = 0;
        for (int i = 0; i < l.Count; i++)
        {
            uint el = (uint)l[i];
            hash ^= (el << i) | (el >> (32 - i));
        }
        return new {hash, ind};
    }).OrderBy(l => l.hash).ToList();
    //hashList.Sort();
    uint prevHash = hashList[0].hash;
    int firstInd = 0;            
    for (int i = 1; i <= hashList.Count; i++)
    {
        if (i == hashList.Count || hashList[i].hash != prevHash)
        {
            for (int n = firstInd; n < i; n++)
                for (int m = n + 1; m < i; m++)
                {
                    List<int> x = list[hashList[n].ind];
                    List<int> y = list[hashList[m].ind];
                    if (x.Count == y.Count && x.SequenceEqual(y))
                        Console.WriteLine("Dupes: {0} and {1}", hashList[n].ind, hashList[m].ind);
                }                    
        }
        if (i == hashList.Count)
            break;
        if (hashList[i].hash != prevHash)
        {
            firstInd = i;
            prevHash = hashList[i].hash;
        }
    }
}
like image 96
Andrey Avatar answered Sep 24 '22 03:09

Andrey


Unless you're doing some seriously heavy lifting, perhaps the following simple code will work for you:

var lists = new List<List<int>>()
{
   new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
   new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
   new List<int>() {0 ,1, 4, 2, 4, 5, 6 },
   new List<int>() {0 ,3, 2, 5, 1, 6, 4 }
};

var duplicates = from list in lists
                 where lists.Except(new[] { list }).Any(l => l.SequenceEqual(list))
                 select list;

Obviously you could get better performance if you hand-tweak an algorithm such that you don't have to scan the lists each iteration, but there is something to be said for writing declarative, simpler code.

(Plus, thanks to the Awesomeness of LINQ®, by adding a .AsParallel() call to the above code, the algorithm will run on multiple cores, thus running potentially faster than the complex, hand-tweaked solutions mentioned in this thread.)

like image 43
Judah Gabriel Himango Avatar answered Sep 24 '22 03:09

Judah Gabriel Himango


Something like this will give you the correct results:

List<List<int>> list = new List<List<int>>(){
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
  new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};

list.ToLookup(l => String.Join(",", l.Select(i => i.ToString()).ToArray()))
    .Where(lk => lk.Count() > 1)
    .SelectMany(group => group);
like image 44
theburningmonk Avatar answered Sep 24 '22 03:09

theburningmonk