Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

intersect and any or contains and any. Which is more efficient to find at least one common element?

Tags:

c#

linq

If I have two list and I want to know if there are at least one common element, I have this two options:

lst1.Intersect(lst2).Any();

Lst1.Any(x => lst2.Contains(x));

The two options give me the result that I expect, however I don't know what is the best option. Which is more efficient? And why?

Thanks.

EDIT: when I created this post, apart of the solution, I was looking the reason. I know that I can run tests, but I wouldn't know the reason of the result. One is faster than the other? Is always one solution best than the other?

So for this reason, I hace accepted the answer of Matthew, not only for the test code, but also he explain when one is better than other and why. I appreciate a lot the contributions of Nicholas and Oren too.

Thanks.

like image 810
Álvaro García Avatar asked Jun 17 '13 17:06

Álvaro García


3 Answers

Oren's answer has an error in the way the stopwatch is being used. It isn't being reset at the end of the loop after the time taken by Any() has been measured.

Note how it goes back to the start of the loop with the stopwatch never being Reset() so that the time that is added to intersect includes the time taken by Any().

Following is a corrected version.

A release build run outside any debugger gives this result on my PC:

Intersect:    1ms
Any:       6743ms

Note how I'm making two non-overlapping string lists for this test. Also note that this is a worst-case test.

Where there are many intersections (or intersections that happen to occur near the start of the data) then Oren is quite correct to say that Any() should be faster.

If the real data usually contains intersections then it's likely that it is better to use Any(). Otherwise, use Intersect(). It's very data dependent.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        void run()
        {
            double intersect = 0;
            double any = 0;

            Stopwatch stopWatch = new Stopwatch();

            List<string> L1 = Enumerable.Range(0, 10000).Select(x => x.ToString()).ToList();
            List<string> L2 = Enumerable.Range(10000, 10000).Select(x => x.ToString()).ToList();

            for (int i = 0; i < 10; i++)
            {
                stopWatch.Restart();
                Intersect(L1, L2);
                stopWatch.Stop();
                intersect += stopWatch.ElapsedMilliseconds;

                stopWatch.Restart();
                Any(L1, L2);
                stopWatch.Stop();
                any += stopWatch.ElapsedMilliseconds;
            }

            Console.WriteLine("Intersect: " + intersect + "ms");
            Console.WriteLine("Any: " + any + "ms");
        }

        private static bool Any(List<string> lst1, List<string> lst2)
        {
            return lst1.Any(lst2.Contains);
        }

        private static bool Intersect(List<string> lst1, List<string> lst2)
        {
            return lst1.Intersect(lst2).Any();
        }            

        static void Main()
        {
            new Program().run();
        }
    }
}

For comparative purposes, I wrote my own test comparing int sequences:

intersect took 00:00:00.0065928
any took       00:00:08.6706195

The code:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        void run()
        {
            var lst1 = Enumerable.Range(0, 10000);
            var lst2 = Enumerable.Range(10000, 10000);

            int count = 10;

            DemoUtil.Time(() => lst1.Intersect(lst2).Any(), "intersect", count);
            DemoUtil.Time(() => lst1.Any(lst2.Contains),     "any",      count);
        }

        static void Main()
        {
            new Program().run();
        }
    }

    static class DemoUtil
    {
        public static void Print(this object self)
        {
            Console.WriteLine(self);
        }

        public static void Print(this string self)
        {
            Console.WriteLine(self);
        }

        public static void Print<T>(this IEnumerable<T> self)
        {
            foreach (var item in self)
                Console.WriteLine(item);
        }

        public static void Time(Action action, string title, int count)
        {
            var sw = Stopwatch.StartNew();

            for (int i = 0; i < count; ++i)
                action();

            (title + " took " + sw.Elapsed).Print();
        }
    }
}

If I also time this for overlapping ranges by changing the lists to this and upping the count to 10000:

var lst1 = Enumerable.Range(10000, 10000);
var lst2 = Enumerable.Range(10000, 10000);

I get these results:

intersect took 00:00:03.2607476
any took 00:00:00.0019170

In this case Any() is clearly much faster.

Conclusion

The worst-case performance is very bad for Any() but acceptible for Intersect(). The best-case performance is extremely good for Any() and bad for Intersect(). (and best-case for Any() is probably worst-case for Intersect()!)

The Any() approach is O(N^2) in the worst case and O(1) in the best case. The Intersect() approach is always O(N) (since it uses hashing, not sorting, otherwise it would be O(N(Log(N))).

You must also consider the memory usage: the Intersect() method needs to take a copy of one of the inputs, whereas Any() doesn't.

Therefore to make the best decision you really need to know the characteristics of the real data, and actually perform tests.

If you really don't want the Any() to turn into an O(N^2) in the worst case, then you should use Intersect(). However, the chances are that you will be best off using Any().

And of course, most of the time none of this matters!

Unless you've discovered this part of the code to be a bottleneck, this is of merely academic interest. You shouldn't waste your time with this kind of analysis if there's no problem. :)

like image 191
Matthew Watson Avatar answered Nov 08 '22 22:11

Matthew Watson


It depends on the implementation of your IEnumerables.

Your first try (Intersect/Any), finds all the matches and then determines if the set is empty or not. From the documentation, this looks to be something like O(n) operation:

When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates [the] second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected.

Your second try ( Any/Contains ) enumerates over the first collection, an O(n) operation, and for each item in the first collection, enumerates over the second, another O(n) operation, to see if a matching element is found. This makes it something like an O(n2) operation, does it not? Which do you think might be faster?

One thing to consider, though, is that the Contains() lookup for certain collection or set types (e.g., dictionaries, binary trees or ordered collections that allow a binary search or hashtable lookup) might be a cheap operation if the Contains() implementation is smart enough to take advantage of the semantics of the collection upon which it is operating.

But you'll need to experiment with your collection types to find out which works better.

like image 44
Nicholas Carey Avatar answered Nov 08 '22 21:11

Nicholas Carey


See Matthew's answer for a complete and accurate breakdown.

Relatively easy to mock up and try yourself:

        bool found;

        double intersect = 0;
        double any = 0;

        for (int i = 0; i < 100; i++)
        {
            List<string> L1 = GenerateNumberStrings(200000);
            List<string> L2 = GenerateNumberStrings(60000);
            Stopwatch stopWatch = new Stopwatch();

            stopWatch.Start();
            found = Intersect(L1, L2);
            stopWatch.Stop();
            intersect += stopWatch.ElapsedMilliseconds;

            stopWatch.Reset();

            stopWatch.Start();
            found = Any(L1, L2);
            stopWatch.Stop();
            any += stopWatch.ElapsedMilliseconds;
        }

        Console.WriteLine("Intersect: " + intersect + "ms");
        Console.WriteLine("Any: " + any + "ms");
    }

    private static bool Any(List<string> lst1, List<string> lst2)
    {
        return lst1.Any(x => lst2.Contains(x));
    }

    private static bool Intersect(List<string> lst1, List<string> lst2)
    {
        return lst1.Intersect(lst2).Any();
    }

You'll find that the Any method is significantly faster in the long run, likely because it does not require the memory allocations and setup that intersect requires (Any stops and returns true as soon as it finds a match whereas Intersect actually needs to store the matches in a new List<T>).

like image 1
Oren Avatar answered Nov 08 '22 23:11

Oren