Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I have a large list of GUIDs and other data that I need to pull a subset from, what is the quickest way to do this?

I have a list of audit data from Dynamics CRM 2013 that I have deserialised and stuck into a HashSet, defined as:

private class AuditCache
{
    public Guid ObjectId;
    public int HistoryId;
    public DateTime? DateFrom;
    public DateTime? DateTo;
    public string Value;
};
private HashSet<AuditCache> _ac = new HashSet<AuditCache>();

I add the data like this (from a SQL Server recordset):

_ac.Add(new AuditCache{ 
    ObjectId = currentObjectId,
    HistoryId = Convert.ToInt32(dr["HistoryId"]),
    DateTo = Convert.ToDateTime(dr["CreatedOn"]),
    Value = value});

I end up with roughly half a million records.

Next I need to iterate through every Guid and pull out the subset of data from my audit data that matches. I have a list of the Guids that I generate elsewhere and there are around 300,000 to process. I store them in this:

var workList = new Dictionary<Guid, DateTime>();

...and iterate through them like this:

foreach (var g in workList)

Then I need to do this to pull out the subset for each Guid:

List<AuditCache> currentSet = _ac.Where(v => v.ObjectId == g.Key).ToList();

But it's slow.

It take around 1 minute to populate my initial audit data list but will take hours (I never ran it to completion so this is based on the time to process 1% of the data) to pull out each set, process it and squirt it back into a database table.

Stepping through the code I can see the bottleneck seems to be pulling out the subset from my list for each Guid. So my question is, is there a better/ more efficient way (architecture?) to store/ retrieve my data set?

One thing to note, I know Guids are inherently slow to index/ search but I am pretty much constrained to using them due to the way Dynamics CRM works. I guess I could create a Dictionary to lookup Guids and "convert" them to integer values, or something along those lines, but I am not convinced that would help much?

Edit

Okay, I tested the three solutions using my live data (371,901 Guids) and these are the results as an average time per 1,000 Guids. Note that this includes the processing/ INSERT to SQL Server so it isn't a proper benchmark.

Method #0 - List with Lambda    ~30.00s per 1,000 rows (I never benchmarked this precisely)
Method #1 - IntersectWith        40.24s per 1,000 rows (cloning my Hashset spoilt this)
Method #2 - BinarySearch          3.20s per 1,000 rows
Method #3 - Generic Dictionary    2.19s per 1,000 rows

On the basis of this I am going to probably rewrite my code from scratch as I think the whole approach I was taking was incorrect.

However, this has been a very useful learning exercise and many thanks to everyone who contributed. I am going to accept the BinarySearch as the correct answer as it does what I wanted and is much faster than my original code.

Just to be clear here, the IntersectWith is indeed "smoking" fast, but it doesn't work for my particular problem as I need to constantly go back to my original hashset.

like image 514
Richard Hansell Avatar asked Aug 07 '14 13:08

Richard Hansell


2 Answers

1 million in 1.2 seconds on a P4

HashSet has IntersectWith
It is smoking fast

If the collection represented by the other parameter is a HashSet collection with the same equality comparer as the current HashSet object, this method is an O(n) operation. Otherwise, this method is an O(n + m) operation, where n is Count and m is the number of elements in other.

But to make it work you need AuditCache to implement Object and override both GetHashCode and Equals
Object.GetHashCode Method

GUID hash nicely (minimal collision) so this will be very fast.

WorkList will need to also be AuditCache (even if it is really not AuditCache)
Or you could have them both implement a class that uses Guid ObjectId as the key (Equal and GetHashCode)

GUID is NOT inherently slow to index if you are using it as a hashed key (dictionary and hashset) - it is a great key as it would have few (or no) collisions. Even if you go with both as Dictionary with key as Guid it would be a lot faster. But Dictionary does not have IntersectWith.

1 million in 1.2 seconds on a P4 0.5 seconds to clone and 0.7 seconds to intersect

using System.Diagnostics;
namespace HashSetIntersect
{
    /// <summary>
    /// Interaction logic for MainWindow.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        public MainWindow()
        {
            InitializeComponent();
            Stopwatch sw = new Stopwatch();
            sw.Start();
            HashSet<AuditCache> TestHashKeys1 = new HashSet<AuditCache>();
            HashSet<AuditCache> TestHashKeys2 = new HashSet<AuditCache>();
            for (UInt32 i = 0; i < 1000000; i++)
            {
                Guid g = Guid.NewGuid();
                TestHashKeys1.Add(new AuditCache(g, 1, (DateTime?)null, (DateTime?)null, "value1"));
                if (i % 2 == 0) TestHashKeys2.Add(new AuditCache(g, 0, (DateTime?)null, (DateTime?)null, "value2"));
            }            
            Debug.WriteLine(TestHashKeys1.Count.ToString() + " " + TestHashKeys2.Count.ToString());
            sw.Stop();
            Debug.WriteLine(sw.ElapsedMilliseconds.ToString());
            sw.Restart();
            HashSet<AuditCache> TestHashKeys3 = new HashSet<AuditCache>(TestHashKeys1);
            sw.Stop();
            Debug.WriteLine(sw.ElapsedMilliseconds.ToString());
            sw.Restart();
            TestHashKeys3.IntersectWith(TestHashKeys2);
            sw.Stop();
            Debug.WriteLine(sw.ElapsedMilliseconds.ToString());
            foreach (AuditCache ac in TestHashKeys3)
            {
                Debug.WriteLine(ac.Value);
            }
        }
    }
    public abstract class HashKey : Object
    {
        public Guid ObjectId { get; private set; }
        public override bool Equals(object obj)
        {
            if (!(obj is HashKey)) return false;
            HashKey comp = (HashKey)obj;
            return this.ObjectId == comp.ObjectId;
        }

        public override int GetHashCode()
        {
            return ObjectId.GetHashCode();
        }
        public HashKey(Guid objectId)
        {
            ObjectId = objectId;
        }
    }
    public class TestHashKey : HashKey
    {
        public TestHashKey(Guid ObjectId)
            : base(ObjectId)
        { }
    }
    public class AuditCache : HashKey
    {
        public int HistoryId { get; private set; }
        public DateTime? DateFrom { get; private set; }
        public DateTime? DateTo { get; private set; }
        public string Value { get; private set; }
        public AuditCache(Guid objectId, int historyId, DateTime? dateFrom, DateTime? dateTo, string value)
            : base(objectId)
        {
            HistoryId = historyId;
            DateFrom = dateFrom;
            DateTo = dateTo;
            Value = value;
        }
    }
}
like image 139
paparazzo Avatar answered Nov 15 '22 08:11

paparazzo


How about if you sort the AuditCache list by GUID (which is a big integer after all) and then use List<T>.BinarySearch on it?

I got pretty good results for it (under 15 seconds on an i3-3110M @2.4Ghz). Cumulative times below:

  1. Collection created: 892 ms
  2. Collection sorted: 9285 ms
  3. Lookup: 12055 ms

Below I'm using BigInteger from System.Numerics to interpret Guids as 128 bit integers.

If I'm not missing something, then this should work. Note that the lookup for loop is actually worst case scenario, because it is very unlikely that there are collisions (so index will always be -1). In your case it could be even faster:

class AuditCache
{
    public Guid ObjectId;
    public int HistoryId;
    public DateTime? DateFrom;
    public DateTime? DateTo;
    public string Value;
};

class AuditCacheComparer : IComparer<AuditCache>
{
    public int Compare(AuditCache x, AuditCache y)
    {
        BigInteger intx = new BigInteger(x.ObjectId.ToByteArray());
        BigInteger inty = new BigInteger(y.ObjectId.ToByteArray());
        if (intx < inty)
        {
            return -1;
        }
        else if (intx > inty)
        {
            return 1;
        }

        return 0;
    }
}

class Program
{

    static void Main(string[] args)
    {
        List<AuditCache> testCollection = new List<AuditCache>();
        Stopwatch sw = Stopwatch.StartNew();

        for (int i = 0; i != 1000000; ++i)
        {
            testCollection.Add(new AuditCache() { ObjectId = Guid.NewGuid(), HistoryId = i });
        }

        Console.WriteLine("Collection created: {0} ms", sw.ElapsedMilliseconds);

        AuditCacheComparer comparer = new AuditCacheComparer();
        testCollection.Sort(comparer);

        Console.WriteLine("Collection sorted: {0} ms", sw.ElapsedMilliseconds);

        for(int i = 0; i != 300000; ++ i)
        {
            var index = testCollection.BinarySearch(new AuditCache() {ObjectId = Guid.NewGuid()}, comparer);
            if (index > 0)
            {
                Console.WriteLine("Found: {0} ms", sw.ElapsedMilliseconds);
            }
        }

        Console.WriteLine("Lookup: {0} ms", sw.ElapsedMilliseconds);
        Console.ReadLine();
    }
}
like image 35
Marcel N. Avatar answered Nov 15 '22 09:11

Marcel N.