I am working on an application whose bottleneck is list1.Except(list2). From this post : should i use Except or Contains when dealing with HashSet or so in Linq the complexity of except is O(m+n) (m and n standing for the sizes of the lists). However, my lists are sorted. Can this help ?
The first implementation I can think of :
foreach element in list2 (m operations)
look for it in list1 (ln(n) operations)
if present
set it to null (O(1), removing has a O(n))
else continue
It has complexity O(m*ln(n)), which is very interesting when m is small and n is large (this is exactly the case with my data sets : m is around 50, n is around 1 000 000). However the fact that it produces null may have a lot of implications on the functions using it... Is there any way to keep this complexity, without having to write nulls (and then keeping track of them)
Any help would be greatly appreciated!
using System;
using System.Collections.Generic;
public class Test
{
public static void Main()
{
var listM = new List<int>();
var listN = new List<int>();
for(int i = 0, x = 0; x < 50; i+=13, x++) {
listM.Add(i);
}
for(int i = 0, x = 0; x < 10000; i+=7, x++) {
listN.Add(i);
}
Console.WriteLine(SortedExcept(listM, listN).Count);
}
public static List<T> SortedExcept<T>(List<T> m, List<T> n) {
var result = new List<T>();
foreach(var itm in m) {
var index = n.BinarySearch(itm);
if(index < 0) {
result.Add(itm);
}
}
return result;
}
}
EDIT Here's the O(M + N) version as well
public static List<T> SortedExcept2<T>(List<T> m, List<T> n) where T : IComparable<T> {
var result = new List<T>();
int i = 0, j = 0;
if(n.Count == 0) {
result.AddRange(m);
return result;
}
while(i < m.Count) {
if(m[i].CompareTo(n[j]) < 0) {
result.Add(m[i]);
i++;
} else if(m[i].CompareTo(n[j]) > 0) {
j++;
} else {
i++;
}
if(j >= n.Count) {
for(; i < m.Count; i++) {
result.Add(m[i]);
}
break;
}
}
return result;
}
In a quick & dirty benchmark http://ideone.com/Y2oEQD M + N is always faster even when N is 10million. BinarySearch suffers penalties because it's accessing array memory in a non-linear fashion; this causes cache misses which slows down the algorithm, so the larger N gets the more memory access penalties with BinarySearch.
If both lists are sorted, then you can implement your own solution easily:
listA except listB algorithm then works as follows:
1. Start from the beginning of both lists
2. If listA element is smaller than the listB element,
then include the listA element in the output and advance listA
3. If listB element is smaller than the listA element, advance listB
4. If listA and listB elements are equal,
advance both lists and do not push the element to the output
Repeat until listA is exhausted. Take special care that listB might be exhausted before listA.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With