Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# fastest intersection of 2 sets of sorted numbers

I'm calculating intersection of 2 sets of sorted numbers in a time-critical part of my application. This calculation is the biggest bottleneck of the whole application so I need to speed it up.

I've tried a bunch of simple options and am currently using this:

foreach (var index in firstSet)
{
    if (secondSet.BinarySearch(index) < 0)
        continue;

    //do stuff
}

Both firstSet and secondSet are of type List.

I've also tried using LINQ:

var intersection = firstSet.Where(t => secondSet.BinarySearch(t) >= 0).ToList();

and then looping through intersection.

But as both of these sets are sorted I feel there's a better way to do it. Note that I can't remove items from sets to make them smaller. Both sets usually consist of about 50 items each.

Please help me guys as I don't have a lot of time to get this thing done. Thanks.

NOTE: I'm doing this about 5.3 million times. So every microsecond counts.

like image 555
gligoran Avatar asked Aug 23 '11 16:08

gligoran


4 Answers

If you have two sets which are both sorted, you can implement a faster intersection than anything provided out of the box with LINQ.

Basically, keep two IEnumerator<T> cursors open, one for each set. At any point, advance whichever has the smaller value. If they match at any point, advance them both, and so on until you reach the end of either iterator.

The nice thing about this is that you only need to iterate over each set once, and you can do it in O(1) memory.

Here's a sample implementation - untested, but it does compile :) It assumes that both of the incoming sequences are duplicate-free and sorted, both according to the comparer provided (pass in Comparer<T>.Default):

(There's more text at the end of the answer!)

static IEnumerable<T> IntersectSorted<T>(this IEnumerable<T> sequence1,
    IEnumerable<T> sequence2,
    IComparer<T> comparer)
{
    using (var cursor1 = sequence1.GetEnumerator())
    using (var cursor2 = sequence2.GetEnumerator())
    {
        if (!cursor1.MoveNext() || !cursor2.MoveNext())
        {
            yield break;
        }
        var value1 = cursor1.Current;
        var value2 = cursor2.Current;

        while (true)
        {
            int comparison = comparer.Compare(value1, value2);
            if (comparison < 0)
            {
                if (!cursor1.MoveNext())
                {
                    yield break;
                }
                value1 = cursor1.Current;
            }
            else if (comparison > 0)
            {
                if (!cursor2.MoveNext())
                {
                    yield break;
                }
                value2 = cursor2.Current;
            }
            else
            {
                yield return value1;
                if (!cursor1.MoveNext() || !cursor2.MoveNext())
                {
                    yield break;
                }
                value1 = cursor1.Current;
                value2 = cursor2.Current;
            }
        }
    }
}

EDIT: As noted in comments, in some cases you may have one input which is much larger than the other, in which case you could potentially save a lot of time using a binary search for each element from the smaller set within the larger set. This requires random access to the larger set, however (it's just a prerequisite of binary search). You can even make it slightly better than a naive binary search by using the match from the previous result to give a lower bound to the binary search. So suppose you were looking for values 1000, 2000 and 3000 in a set with every integer from 0 to 19,999. In the first iteration, you'd need to look across the whole set - your starting lower/upper indexes would be 0 and 19,999 respectively. After you'd found a match at index 1000, however, the next step (where you're looking for 2000) can start with a lower index of 2000. As you progress, the range in which you need to search gradually narrows. Whether or not this is worth the extra implementation cost or not is a different matter, however.

like image 69
Jon Skeet Avatar answered Nov 05 '22 22:11

Jon Skeet


Since both lists are sorted, you can arrive at the solution by iterating over them at most once (you may also get to skip part of one list, depending on the actual values they contain).

This solution keeps a "pointer" to the part of list we have not yet examined, and compares the first not-examined number of each list between them. If one is smaller than the other, the pointer to the list it belongs to is incremented to point to the next number. If they are equal, the number is added to the intersection result and both pointers are incremented.

var firstCount = firstSet.Count;
var secondCount = secondSet.Count;
int firstIndex = 0, secondIndex = 0;
var intersection = new List<int>();

while (firstIndex < firstCount && secondIndex < secondCount)
{
    var comp = firstSet[firstIndex].CompareTo(secondSet[secondIndex]);
    if (comp < 0) {
        ++firstIndex;
    }
    else if (comp > 0) {
        ++secondIndex;
    }
    else {
        intersection.Add(firstSet[firstIndex]);
        ++firstIndex;
        ++secondIndex;
    }
}

The above is a textbook C-style approach of solving this particular problem, and given the simplicity of the code I would be surprised to see a faster solution.

like image 27
Jon Avatar answered Nov 05 '22 22:11

Jon


You're using a rather inefficient Linq method for this sort of task, you should opt for Intersect as a starting point.

var intersection = firstSet.Intersect(secondSet);

Try this. If you measure it for performance and still find it unwieldy, cry for further help (or perhaps follow Jon Skeet's approach).

like image 5
Anthony Pegram Avatar answered Nov 05 '22 20:11

Anthony Pegram


I was using Jon's approach but needed to execute this intersect hundreds of thousands of times for a bulk operation on very large sets and needed more performance. The case I was running in to was heavily imbalanced sizes of the lists (eg 5 and 80,000) and wanted to avoid iterating the entire large list.

I found that detecting the imbalance and changing to an alternate algorithm gave me huge benifits over specific data sets:

public static IEnumerable<T> IntersectSorted<T>(this List<T> sequence1,
        List<T> sequence2,
        IComparer<T> comparer)
{
    List<T> smallList = null;
    List<T> largeList = null;

    if (sequence1.Count() < Math.Log(sequence2.Count(), 2))
    {
        smallList = sequence1;
        largeList = sequence2;
    }
    else if (sequence2.Count() < Math.Log(sequence1.Count(), 2))
    {
        smallList = sequence2;
        largeList = sequence1;
    }

    if (smallList != null)
    {
        foreach (var item in smallList)
        {
            if (largeList.BinarySearch(item, comparer) >= 0)
            {
                yield return item;
            }
        }
    }
    else
    {
        //Use Jon's method
    }
}

I am still unsure about the point at which you break even, need to do some more testing

like image 2
Spencer Rose Avatar answered Nov 05 '22 20:11

Spencer Rose