Finding duplicates within list of list

Question

Simple situation. I have a list of lists, almost table like, and I am trying to find out if any of the lists are duplicated.

Example:

List<List<int>> list = new List<List<int>>(){
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
  new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};

I would like to know that there are 4 total items, 2 of which are duplicates. I was thinking about doing something like a SQL checksum but I didn't know if there was a better/easier way.

I care about performance, and I care about ordering.

Additional Information That May Help

Things inserted into this list will never be removed
Not bound to any specific collection.
Dont care about function signature
They type is not restricted to int

Andrey · Accepted Answer

Let's try to get best performace. if n is number of lists and m is length of lists then we can get O(nm + nlogn + n) plus some probability of hash codes to be equal for different lists.

Major steps:

Calculate hash codes*
Sort them
Go over list to find dupes

* this is important step. for simlicity you can calc hash as = ... ^ (list[i] << i) ^ (list[i + 1] << (i + 1))

Edit for those people that think that PLINQ can boost the thing, but not good algorythm. PLINQ can also be added here, because all the steps are easily parallelizable.

My code:

static public void Main()
{
    List<List<int>> list = new List<List<int>>(){
      new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
      new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
      new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
      new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
    };
    var hashList = list.Select((l, ind) =>
    {
        uint hash = 0;
        for (int i = 0; i < l.Count; i++)
        {
            uint el = (uint)l[i];
            hash ^= (el << i) | (el >> (32 - i));
        }
        return new {hash, ind};
    }).OrderBy(l => l.hash).ToList();
    //hashList.Sort();
    uint prevHash = hashList[0].hash;
    int firstInd = 0;            
    for (int i = 1; i <= hashList.Count; i++)
    {
        if (i == hashList.Count || hashList[i].hash != prevHash)
        {
            for (int n = firstInd; n < i; n++)
                for (int m = n + 1; m < i; m++)
                {
                    List<int> x = list[hashList[n].ind];
                    List<int> y = list[hashList[m].ind];
                    if (x.Count == y.Count && x.SequenceEqual(y))
                        Console.WriteLine("Dupes: {0} and {1}", hashList[n].ind, hashList[m].ind);
                }                    
        }
        if (i == hashList.Count)
            break;
        if (hashList[i].hash != prevHash)
        {
            firstInd = i;
            prevHash = hashList[i].hash;
        }
    }
}

Judah Gabriel Himango · Answer

Unless you're doing some seriously heavy lifting, perhaps the following simple code will work for you:

var lists = new List<List<int>>()
{
   new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
   new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
   new List<int>() {0 ,1, 4, 2, 4, 5, 6 },
   new List<int>() {0 ,3, 2, 5, 1, 6, 4 }
};

var duplicates = from list in lists
                 where lists.Except(new[] { list }).Any(l => l.SequenceEqual(list))
                 select list;

Obviously you could get better performance if you hand-tweak an algorithm such that you don't have to scan the lists each iteration, but there is something to be said for writing declarative, simpler code.

(Plus, thanks to the Awesomeness of LINQ®, by adding a .AsParallel() call to the above code, the algorithm will run on multiple cores, thus running potentially faster than the complex, hand-tweaked solutions mentioned in this thread.)

theburningmonk · Answer

Something like this will give you the correct results:

List<List<int>> list = new List<List<int>>(){
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
  new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};

list.ToLookup(l => String.Join(",", l.Select(i => i.ToString()).ToArray()))
    .Where(lk => lk.Count() > 1)
    .SelectMany(group => group);

Finding duplicates within list of list

Tags:

c#

algorithm

Nix

3 Answers

Andrey

Judah Gabriel Himango

theburningmonk

Recent Activity

Donate For Us

Finding duplicates within list of list

Tags:

c#

algorithm

Nix

3 Answers

Andrey

Judah Gabriel Himango

theburningmonk

Related questions

Recent Activity

Donate For Us