Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the fastest way to generate a unique set in .net 2

I have what is essentially a jagged array of name value pairs - i need to generate a set of unique name values from this. the jagged array is approx 86,000 x 11 values. It does not matter to me what way I have to store a name value pair (a single string "name=value" or a specialised class for example KeyValuePair).
Additional Info: There are 40 distinct names and a larger number of distinct values - probably in the region 10,000 values.

I am using C# and .NET 2.0 (and the performance is so poor I am thinking that it may be better to push my entire jagged array into a sql database and do a select distinct from there).

Below is the current code Im using:

List<List<KeyValuePair<string,string>>> vehicleList = retriever.GetVehicles();
this.statsLabel.Text = "Unique Vehicles: " + vehicleList.Count;

Dictionary<KeyValuePair<string, string>, int> uniqueProperties = new Dictionary<KeyValuePair<string, string>, int>();
foreach (List<KeyValuePair<string, string>> vehicle in vehicleList)
{
    foreach (KeyValuePair<string, string> property in vehicle)
    {
        if (!uniqueProperties.ContainsKey(property))
        {
            uniqueProperties.Add(property, 0);
        }
    }
}
this.statsLabel.Text += "\rUnique Properties: " + uniqueProperties.Count;
like image 214
dice Avatar asked Oct 24 '08 10:10

dice


1 Answers

I have it running in 0.34 seconds down from 9+ minutes

The problem is when comparing the KeyValuePair structs. I worked around it by writing a comparer object, and passing an instance of it to the Dictionary.

From what I can determine, the KeyValuePair.GetHashCode() returns the hashcode of it's Key object (in this example the least unique object).

As the dictionary adds (and checks existence of) each item, it uses both Equals and GetHashCode functions, but has to rely on the Equals function when the hashcode is less unique.

By providing a more unique GetHashCode function, it excerises the Equals function far less often. I also optimised the Equals function to compare the more unique Values before the less unqiue Keys.

86,000 * 11 items with 10,000 unique properties runs in 0.34 seconds using the comparer object below (without the comparer object it takes 9 minutes 22 seconds)

Hope this helps :)

    class StringPairComparer
        : IEqualityComparer<KeyValuePair<string, string>>
    {
        public bool Equals(KeyValuePair<string, string> x, KeyValuePair<string, string> y)
        {
            return x.Value == y.Value && x.Key == y.Key;
        }
        public int GetHashCode(KeyValuePair<string, string> obj)
        {
            return (obj.Key + obj.Value).GetHashCode();
        }
    }

EDIT: If it was just one string (instead of a KeyValuePair, where string = Name+Value) it would be approx twice as fast. It's a nice intresting problem, and I have spent faaaaaar too much time on it (I learned quiet a bit though)

like image 119
Binary Worrier Avatar answered Nov 14 '22 22:11

Binary Worrier