Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the Most Efficient way to compare large List of integers to smaller List of integers?

At the moment I have a list of 1million integers, and I check each integer against a blacklist of 2000 integers. This is taking about 2 minutes.

for(int i = 0; i< MillionIntegerList.Length ; i++) {     for(int blacklisted = 0; blacklisted < TwoThousandIntegerList.Length ; blacklisted++)         if(i==blacklisted)             i = 0; //Zero is a sentinel value  } 

This makes 2,000,000,000 iterations(loops) altogether. Is there a better way Im not seeing? thanks

like image 712
theUser Avatar asked May 23 '12 12:05

theUser


People also ask

How do you compare integers in two lists in Python?

We can club the Python sort() method with the == operator to compare two lists. Python sort() method is used to sort the input lists with a purpose that if the two input lists are equal, then the elements would reside at the same index positions.

Does order matter compare list Python?

Contrary to lists, sets in Python don't care about order. For example, a set {1, 2, 3} is the same as {2, 3, 1} . As such, we can use this feature to compare the two lists ignoring the elements' order. To do so, we convert each list into a set, then using the == to compare them.


2 Answers

Three options now - the first two are more general, in that they don't rely on MillionIntegerList being sorted (which wasn't originally specified). The third is preferable in the case where the large list is already sorted.

Option 1

Yes, there's definitely a better way of doing it, using LINQ:

var common = MillionIntegerList.Intersect(TwoThousandIntegerList).ToList(); 

That will internally use a HashSet<int> built via the TwoThousandIntegerList, then look up each element of MillionIntegerList within it - which will be much more efficient than going through the whole of TwoThousandIntegerList each time.

If you only want the non-blacklisted ones, you need:

var valid = MillionIntegerList.Except(TwoThousandIntegerList).ToList(); 

Note that if you only need to iterate over the results once, you should remove the ToList call - I've included it to materialize the results so they can be examined multiple times cheaply. If you're just iterating, the return value of Intersect or Except will just stream the results, making it much cheaper in terms of memory usage.

Option 2

If you don't want to rely on the implementation details of LINQ to Objects but you still want a hash-based approach:

var hashSet = new HashSet<int>(TwoThousandIntegerList); hashSet.IntersectWith(MillionIntegerList); // Now use hashSet 

Option 3

The approach of using the fact that the large list is sorted would definitely be useful.

Assuming you don't mind sorting the blacklisted list first as well, you could write a streaming (and general purpose) implementation like this (untested):

// Note: to use this, you'd need to make sure that *both* sequences are sorted. // You could either sort TwoThousandIntegerList in place, or use LINQ's OrderBy // method.  public IEnumerable<T> SortedIntersect<T>(this IEnumerable<T> first,     IEnumerable<T> second) where T : IComparable<T> {     using (var firstIterator = first.GetEnumerator())     {         if (!firstIterator.MoveNext())         {             yield break;         }          using (var secondIterator = second.GetEnumerator())         {             if (!secondIterator.MoveNext())             {                 yield break;             }             T firstValue = firstIterator.Current;             T secondValue = secondIterator.Current;              while (true)             {                 int comparison = firstValue.CompareTo(secondValue);                 if (comparison == 0) // firstValue == secondValue                 {                     yield return firstValue;                 }                 else if (comparison < 0) // firstValue < secondValue                 {                     if (!firstIterator.MoveNext())                     {                         yield break;                     }                     firstValue = firstIterator.Current;                 }                 else // firstValue > secondValue                 {                     if (!secondIterator.MoveNext())                     {                         yield break;                     }                     secondValue = secondIterator.Current;                 }               }                         }     } } 

(You could take an IComparer<T> if you wanted instead of relying on T being comparable.)

like image 79
Jon Skeet Avatar answered Oct 23 '22 11:10

Jon Skeet


Since the large list is sorted. You might get the best results by sorting the small list (very quick) and then doing a linear merge. You'll only need to look at each item in the large (and small) list once, and there will be no need for a Hashtable to be created in the background.

See the merge function part of MergeSort for an idea on how to do this.

like image 38
Daniel Renshaw Avatar answered Oct 23 '22 11:10

Daniel Renshaw