I have a big list of strings (about 5k-20k entries) that I need to order and also to remove duplicates from.
I've done this in 2 ways now, once with a hashset and once solely with linq. Tests with that number of entries did not show a big difference but I'm wondering what way and thus what method would be better suited.
For the ways (myList is of the datatype List):
Linq: I'm using 1 linq statement to order the list and get the distinct values from it.
myList = myList.OrderBy(q => q).Distinct().ToList();
Hashset: I'm using hashset to remove all duplicates and then I'm ordering the list
myList = new HashSet<String>(myList).ToList<String>();
myList = myList.OrderBy(q => q).ToList();
Like I said tests I made were about the same time consumption for both methods but I'm still wondering if one method is better than the other and if so why (the code is for a high performance part and I need to get every millisecond I can out of it).
C# Linq Distinct() method removes the duplicate elements from a sequence (list) and returns the distinct elements from a single data source. It comes under the Set operators' category in LINQ query operators, and the method works the same way as the DISTINCT directive in Structured Query Language (SQL).
Note: Hashset is a collection of distinct values.
If you're really concerned about every nanosecond, then
myList = myList.Distinct().OrderBy(q => q).ToList();
might be slightly faster than:
myList = myList.OrderBy(q => q).Distinct().ToList();
if there are a large number of duplicates.
The LINQ method is more readable and will have similar performance to explicitly creating a HashSet<T>
as others have said. In fact it may be slightly faster if the original List is already sorted, since the LINQ method will preserve the initial order before sorting, while explicitly creating a HashSet<T>
will enumerate in an undefined order.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With