Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for matching lists of integers

For each day we have approximately 50,000 instances of a data structure (this could eventually grow to be much larger) that encapsulate the following:

DateTime AsOfDate;
int key;
List<int> values; // list of distinct integers

This is probably not relevant but the list values is a list of distinct integers with the property that for a given value of AsOfDate, the union of values over all values of key produces a list of distinct integers. That is, no integer appears in two different values lists on the same day.

The lists usually contain very few elements (between one and five), but are sometimes as long as fifty elements.

Given adjacent days, we are trying to find instances of these objects for which the values of key on the two days are different, but the list values contain the same integers.

We are using the following algorithm. Convert the list values to a string via

string signature = String.Join("|", values.OrderBy(n => n).ToArray());

then hash signature to an integer, order the resulting lists of hash codes (one list for each day), walk through the two lists looking for matches and then check to see if the associated keys differ. (Also check the associated lists to make sure that we didn't have a hash collision.)

Is there a better method?

like image 654
jason Avatar asked Dec 23 '22 12:12

jason


2 Answers

You could probably just hash the list itself, instead of going through String.

Apart from that, I think your algorithm is nearly optimal. Assuming no hash collisions, it is O(n log n + m log m) where n and m are the numbers of entries for each of the two days you're comparing. (The sorting is the bottleneck.)

You can do this in O(n + m) if you use a bucket array (essentially: a hashtable) that you plug the hashes in. You can compare the two bucket arrays in O(max(n, m)) assuming a length dependent on the number of entries (to get a reasonable load factor).

It should be possible to have the library do this for you (it looks like you're using .NET) by using HashSet.IntersectWith() and writing a suitable compare function.

You cannot do better than O(n + m), because every entry needs to be visited at least once.

Edit: misread, fixed.

like image 194
Thomas Avatar answered Dec 25 '22 01:12

Thomas


On top of the other answers you could make the process faster by creating a low-cost hash simply constructed of a XOR amongst all the elements of each List. You wouldn't have to order your list and all you would get is an int which is easier and faster to store than strings.

Then you only need to use the resulting XORed number as a key to a Hashtable and check for the existence of the key before inserting it. If there is already an existing key, only then do you sort the corresponding Lists and compare them.

You still need to compare them if you find a match because there may be some collisions using a simple XOR.
I think thought that the result would be much faster and have a much lower memory footprint than re-ordering arrays and converting them to strings.

If you were to have your own implementation of the List<>, then you could build the generation of the XOR key within it so it would be recalculated at each operation on the List.
This would make the process of checking duplicate lists even faster.

Code

Below is a first-attempt at implementing this.

Dictionary<int, List<List<int>>> checkHash = new Dictionary<int, List<List<int>>>();

public bool CheckDuplicate(List<int> theList) {
    bool isIdentical = false;
    int xorkey = 0;
    foreach (int v in theList) xorkey ^= v;

    List<List<int>> existingLists;
    checkHash.TryGetValue(xorkey, out existingLists);
    if (existingLists != null) {
        // Already in the dictionary. Check each stored list
        foreach (List<int> li in existingLists) {
            isIdentical = (theList.Count == li.Count);
            if (isIdentical) {
                // Check all elements
                foreach (int v in theList) {
                    if (!li.Contains(v)) {
                        isIdentical = false;
                        break;
                    }
                }
            }
            if (isIdentical) break;
        }
    }
    if (existingLists == null || !isIdentical) {
        // never seen this before, add it
        List<List<int>> newList = new List<List<int>>();
        newList.Add(theList);
        checkHash.Add(xorkey, newList);
    }
    return isIdentical;
}

Not the most elegant or easiest to read at first sight, it's rather 'hackey' and I'm not even sure it performs better than the more elegant version from Guffa.
What it does though is take care of collision in the XOR key by storing Lists of List<int> in the Dictionary.

If a duplicate key is found, we loop through each previously stored List until we found a mismatch.

The good point about the code is that it should be probably as fast as you could get in most cases and still faster than compiling strings when there is a collision.

like image 25
Renaud Bompuis Avatar answered Dec 25 '22 01:12

Renaud Bompuis