Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I need to compare two very large collections with potentially missing elements

Tags:

c#

linq

I've come across a brick wall in a program I am writing for my work. You don't need to know the context specifically, but long story short, I have two collections of around ~650k records each.

Let's assume that collection A is the one I know is correct, and collection B is the one I know is incorrect.

Collection B contains a complex object, which has a Property of the same type as the elements in Collection A (in other words, it looks like a bit like this):

// Where T : IComparable
IEnumerable<DateTime> A = ...; // Collection of T elements
IEnumerable<Complex> B = ...; // Collection of complex elements.
class Complex<DateTime>
{
   public DateTime Time { get; set; }
   .....
}

My issue is that I basically need to sequentially enumerate over A and see if the current element of A exists in a Complex object in B; if it doesn't exist, then I need to create a Complex object which will encapsulate that element (amongst other things).

The problem occurs when I realize that both lists are 650,000 elements long, approx. I can't reduce the data set down; I have to use these 650,000. Right now I've used ICollection.Contains(), and I tried a (naive) implementation of Binary Search, but it just takes far too long.

Have you got any suggestions for me?

EDIT: If it helps, T implements IComparable. EDIT2: Some more context: The IEnumerable is retrieved from a DataTable using Linq To Objects.

        IEnumerable<Complex> set = set.Tbl
            .Where(dbObject => dbObject.TS.CompareTo(default(DateTime)) != 0)
            .Select(Func<DataRow,Complex>) // Function that wraps the DataRow in a Complex object
            // Just done to make debugging a little easier so we still have a large sample but small enough that it doesn't make me grow a beard
            .Take(100000) 
            .AsEnumerable<Complex>();

For sake of completeness in case this question gets archived and anyone else needs to sovle this issue, my current implementation looked a bit like this

        BDataSet bSet = new BDataSet();
        B_LUTableAdapter adap = new B_LUTableAdapter();
        adap.Fill(bSet.B_LU);
        IEnumerable<Complex> w3 = bSet.B
            .Where(dbObject => dbObject.TS.CompareTo(default(DateTime)) != 0)
            // Function that just wraps datarow into a complex object
            .Select(Func<DataRow, Complex>)
            // Just for sake of debugging speed
            .Take(100000)
            .AsEnumerable<Complex>();

        List<Complex> b = bSet.OrderBy(x => x.Time).ToList<Complex>();
        // Get last & first timestamps
        // Some of the timestamps in b are 01/01/1011 for some reason,
        // So we do this check.
        Complex start = b.Where(x => x.Time != default(DateTime)).First();
        Complex end = b.Last();

        List<DateTime> a = new List<DateTime>();
        // RoundSeconds reduces seconds in a DateTime to 0.
        DateTime current = RoundSeconds(new DateTime(start.Time.Ticks));

        while (current.CompareTo(RoundSeconds(end.Time)) <= 0)
        {
            a.Add(current);
            current = current.Add(TimeSpan.FromMinutes(1));
        }

        IEnumerable<DateTime> times = b.Select(x => x.Time);
        var missing = a.Where(dt => times.Contains(dt));
        foreach (var dt in missing)
        {
            adap.Insert(dt, 0, "", "", "", null, 0, 0);
            // This has since been changed to List.Add()
        }

Thanks to Cosmin this issue is now resolved, and the finished implementation is this: List expected = new List(); DateTime current = RoundSeconds(new DateTime(start.Time.Ticks));

        while (current.CompareTo(RoundSeconds(end.Time)) <= 0)
        {
            expected.Add(current);
            current = current.Add(TimeSpan.FromMinutes(1));
        }

        Console.WriteLine("Expecting {0} intervals.", expected.Count);
        var missing = b.FindAllMissing(expected, x => x.Time);
        if(!missing.Any()) return;
        Console.WriteLine("{0} missing intervals.", missing.Count());
        foreach (var dt in missing)
        {
            b.Add(new Complex() { /* some values */ });
            //Console.WriteLine("\t> Inserted new record at {0}", dt);
        }

        //.....
        public static IEnumerable<Basic> FindAllMissing<Basic, Complex>(this IEnumerable<Complex> complexList,
        IEnumerable<Basic> basicList,
        Func<Complex, Basic> selector)
        {
            HashSet<Basic> inComplexList = new HashSet<Basic>();
            foreach (Complex c in complexList)
                inComplexList.Add(selector(c));
            List<Basic> missing = new List<Basic>();
            foreach (Basic basic in basicList)
                if (!(inComplexList.Contains(basic)))
                    missing.Add(basic);
            return missing;
        }
like image 719
Dan Avatar asked Oct 11 '13 07:10

Dan


1 Answers

Step-by-step:

  • Use one of the O(1) generic collections to create a fast-searchable list of T's that are already in the second collection. May I suggest HashSet<T>
  • Enumerate over the SECOND collection and put all the T's in the collection from the first step.
  • Enumerate of the FIRST collection and check if each item is in the collection created at step one. Since that operation is O(1) you've now got a O(n) solution.
  • Enjoy.

Here's a class that implements that algorithm as a generic extension method, to make it extra LINQ-friendly. Made take it's arguments as IEnumerable<T> and return IEnumerable<T>, made no assumptions about the types (T and Complex). In my test I'm using a list of Tuple<int,int> as a complex type and a simple int as the simple type. The console application fills the List<Tuple<int,int>> with 600000 values, then puts 100000 values in the simple List<int> that uses an enumerator to count all the simple values that are not found in the List<Tuple<int,int>>; It's so fast you don't get a chance to see it doing it's work, when you hit F5 it just shows the result.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication2
{

    static class FixProblem
    {
        public static IEnumerable<T> FindAllThatNeedCreating<T, Complex>(this IEnumerable<Complex> list_of_complex, IEnumerable<T> list_of_T, Func<Complex, T> extract)
        {
            HashSet<T> T_in_list_of_complex = new HashSet<T>();
            foreach (Complex c in list_of_complex)
                T_in_list_of_complex.Add(extract(c));
            List<T> answer = new List<T>();
            foreach (T t in list_of_T)
                if (!T_in_list_of_complex.Contains(t))
                    answer.Add(t);
            return answer;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            // Test the code
            List<Tuple<int, int>> complex = new List<Tuple<int, int>>();
            List<int> simple = new List<int>();

            // Fill in some random data
            Random rnd = new Random();
            for (int i = 1; i < 600000; i++)
                complex.Add(new Tuple<int, int>(rnd.Next(), rnd.Next()));

            for (int i = 1; i < 100000; i++)
                simple.Add(rnd.Next());

            // This is the magic line of code:
            Console.WriteLine(complex.FindAllThatNeedCreating(simple, x => x.Item1).Count());

            Console.ReadKey();

        }
    }
}
like image 147
Cosmin Prund Avatar answered Nov 15 '22 00:11

Cosmin Prund