I have got two generic lists with 20,000 and 30,000 objects in each list.
class Employee
{
string name;
double salary;
}
List<Employee> newEmployeeList = List<Employee>() {....} // contains 20,000 objects
List<Employee> oldEmployeeList = List<Employee>() {....} // contains 30,000 objects
Lists can also be sorted by name if it improves the speed.
I want to compare these two lists to find out
What is the fastest way to compare such large data lists with above conditions?
I would sort both newEmployeeList
and oldEmployeeList
lists by name
- O(n*log(n))
. And then you can use linear algorithm to search for matches. So the total would be O(n+n*log(n))
if both lists are about the same size. This should be faster than O(n^2)
"brute force" algorithm.
I'd probably recommend the two lists be stored in a Dictionary<string, Employee>
based on the name to begin with, then you can iterate over the keys in one and lookup to see if they exist and the salaries match in the other. This would also save the cost of sorting them later or putting them in a more efficient structure.
This is pretty much O(n) - linear to build both dictionaries, linear to go through the keys and lookup in the other. Since O(n + m + n) reduces to O(n)
But, if you must use List<T>
to hold the lists for other reasons, you could also use the Join()
LINQ method, and build a new list with a Match
field that tells you whether they were a match or mismatch...
var results = newEmpList.Join(
oldEmpList,
n => n.Name,
o => o.Name,
(n, o) => new
{
Name = n.Name,
Salary = n.Salary,
Match = o.Salary == n.Salary
});
You can then filter this with a Where()
clause for Match
or !Match
.
Update: I assume (by the title of your question) that the 2 lists are already sorted. Perhaps they're stored in a database with a clustered index or something. This answer, therefore, relies on that assumption.
Here is an implementation that has O(n)
complexity, and is also very fast, AND is pretty simple too.
I believe this is a variant of the Merge Algorithm.
Here's the idea:
Since both lists are known to be sorted, this will work very well. This implementation assumes that name
is unique in each list.
var comparer = StringComparer.OrdinalIgnoreCase;
var namesAndSalaries = new List<Tuple<Employee, Employee>>();
var namesOnly = new List<Tuple<Employee, Employee>>();
// Create 2 iterators; one for old, one for new:
using (IEnumerator<Employee> A = oldEmployeeList.GetEnumerator()) {
using (IEnumerator<Employee> B = newEmployeeList.GetEnumerator()) {
// Start enumerating both:
if (A.MoveNext() && B.MoveNext()) {
while (true) {
int compared = comparer.Compare(A.Current.name, B.Current.name);
if (compared == 0) {
// Names match
if (A.Current.salary == B.Current.salary) {
namesAndSalaries.Add(Tuple.Create(A.Current, B.Current));
} else {
namesOnly.Add(Tuple.Create(A.Current, B.Current));
}
if (!A.MoveNext() || !B.MoveNext()) break;
} else if (compared == -1) {
// Keep searching A
if (!A.MoveNext()) break;
} else {
// Keep searching B
if (!B.MoveNext()) break;
}
}
}
}
}
One of fastest possible solutions on sorted lists is use of BinarySearch in order to find an item in another list.
But as mantioned others, you should measure it against your project requirements, as performance often tends to be a subjective thing.
You could create a Dictionary using
var lookupDictionary = list1.ToDictionary(x=>x.name);
That would give you close to O(1) lookup and a close to O(n) behavior if you're looking up values from a loop over the other list.
(I'm assuming here that ToDictionary is O(n) which would make sense with a straight forward implementation, but I have not tested this to be the case)
This would make for a very straight forward algorithm, and I'm thinking going below O(n) with two unsorted lists is pretty hard.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With