Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of built-in .NET collection sorters

There was a question asked about how to sort a List. There were several methods given from the basic List.Sort() to List.OrderBy(). The most laughable was a roll-your-own-SelectionSort. I promptly voted that down, but it made me think; wouldn't Linq's OrderBy(), applied to a list, do the same thing? myList.OrderBy(x=>x.Property).ToList() would produce an iterator that basically finds the minimum value of the projection in what's left of the collection and yield returns it. When going through the entire list, that's a selection sort.

Which made me think; what algorithms do the built-in sorters for Lists, SortedLists, Enumerables, etc. use, and by extension, should any of them be avoided for large collections? A SortedList, as it stays sorted by key, would probably use a single-pass InsertionSort on each add; find the first index with a value greater than the new one, and insert before it. Lists and Arrays probably MergeSort themselves pretty efficiently, but I don't know the actual algorithm behind Sort(). We've discussed OrderBy.

What I know above would seem to indicate that List.Sort() or Array.Sort() are the best options for a list of known size, and using Linq to sort an in-memory list or array should be discouraged. For a stream, there really isn't any other way then to OrderBy() the enumerable; the performance loss is mitigated by the fact that you can keep the data as a stream instead of having to have it all before sorting it.

EDIT:

The general consensus is that Sort() is faster given a concrete implementation of a List or Array. OrderBy is reasonable but slower because it adds O(N) complexity of extracting an array from the passed enumerable. SortedList initialization ends up being O(N^2) because of what's under the hood. Moral of the story, use List.Sort() instead of List.OrderBy() when you have an actual List.

like image 365
KeithS Avatar asked Sep 17 '10 20:09

KeithS


2 Answers

Enumerable.OrderBy() slurps the IEnumerable<> into an array and uses quick sort. O(n) storage requirements. It's done by an internal class in System.Core.dll, EnumerableSort<TElement>.QuickSort(). The storage cost makes it uncompetitive with simply sorting the list, if you have one, since List<> sorts in-place. Linq often optimizes by checking the true capabilities of the IEnumerable with the is operator. Won't work here since List<>.Sort is destructive.

List<>.Sort and Array.Sort use in-place quick sort.

SortedList<> has O(n) complexity for an insertion, dominating the O(log(n)) complexity of finding the insertion point. So putting N unsorted items into it will cost O(n^2). SortedDictionary<> uses a red-black tree, giving insert O(log(n)) complexity. Thus O(nlog(n)) to fill it, same as amortized quick sort.

like image 151
Hans Passant Avatar answered Oct 30 '22 07:10

Hans Passant


A quick gander through reflector tells me that List Sort methods utilize quicksort http://en.wikipedia.org/wiki/Quicksort through System.Collections.Generic.GenericArraySortHelper

SortedList uses Array.BinarySearch to figure out where to insert stuff on each Add

Enumerators don't have sorting logic

Quicksort is a good sorting choice for most situations though it can approach O(n^2) if you're really unlucky with the input data.

If you suspect your input data to be a huge pile of data in an unlucky (already sorted) order for quicksort a trick is to randomize the data first (which is always cheap) and then do the sorting on the randomized data. There are a few tricks the quicksort algorithm can implement to mitigate the problem of sorting already sorted (or nearly sorted) input data, I don't know whether the BCL implementation does any of these.

like image 33
AndreasKnudsen Avatar answered Oct 30 '22 08:10

AndreasKnudsen