What is a good way to get the top 10 records from a very large collection and use a custom OrderBy? If I use the LINQ to Objects OrderBy method it is slow and takes a lot of memory because it creates an entire new collection with the new order. I would like a new method with the signature below that does not re-order the entire collection and is very fast:
public static IEnumerable<TSource> OrderByTop<TSource, TKey>(
IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer,
int topCount)
I tried to write it but it got very complicated and I thought there might be any easier way using Aggregate or something. Any help would be appreciated.
Answer
Thanks for the help. I ended up with the code below:
public static List<TSource> OrderByTop<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer,
int topCount)
{
var itemComparer = keySelector.ToIComparer(comparer);
return source.Aggregate(
new List<TSource>(topCount),
(List<TSource> list, TSource item) =>
list.SortedInsert(item, itemComparer, topCount));
}
The List Extension method SortedInsert follows:
public static List<T> SortedInsert<T>(
this List<T> list,
T item,
IComparer<T> comparer,
int maxLength)
{
if (list.Count == maxLength)
if (comparer.Compare(item, list[maxLength - 1]) >= 0)
return list;
else
list.RemoveAt(maxLength - 1);
int insertIndex = list.BinarySearch(item, comparer);
if (insertIndex < 0)
insertIndex = ~insertIndex;
list.Insert(insertIndex, item);
return list;
}
For those interested I also had keySelector Extension method to convert to IComparer.
public static IComparer<TSource> ToIComparer<TSource, TKey>(
this Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
{
return new KeySelectorToIComparerConverter<TSource, TKey>(
keySelector,
comparer);
}
private class KeySelectorToIComparerConverter<TSource, TKey>
: IComparer<TSource>
{
private readonly IComparer<TKey> comparer;
private readonly Func<TSource, TKey> keySelector;
public KeySelectorToIComparerConverter(
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
{
this.comparer = comparer;
this.keySelector = keySelector;
}
public int Compare(TSource x, TSource y)
{
return comparer.Compare(keySelector(x), keySelector(y));
}
}
This method performs a stable sort; that is, if the keys of two elements are equal, the order of the elements is preserved.
More importantly: when it comes to querying databases, LINQ is in most cases a significantly more productive querying language than SQL. Compared to SQL, LINQ is simpler, tidier, and higher-level. It's rather like comparing C# to C++.
Aggregate
is a good place to start with:
SortedList<TKey, TSource> resultlist = new SortedList<TKey, TSource>();
MyBigList.Aggregate(resultlist, (aktlist,entry) => {
aktlist.Add(entry.Key, entry);
if (aktlist.Count > 10) aktlist.RemoveAt(10);
return aktlist;
});
If you want a different comparer, you can specify one in the constructor of the SortedList
.
EDIT As mentioned by nikie, SortedList
cannot contain double values. You can use a standard list together with BinarySearch
to achieve the same effect:
List<TSource> resultlist = new List<TSource>();
MyBigList.Aggregate(resultlist, (aktlist, entry) => {
int index = aktlist.BinarySearch(entry);
if (index < 0) index = ~index;
if (index < 10) aktlist.Insert(index, entry);
if (aktlist.Count > 10) aktlist.RemoveAt(10);
return aktlist;
});
Again a custom comparer (together with a custom key selection) can be used as parameter to BinarySearch
.
I think what you want is really a selection algorithm. I don't know that LINQ is the best way to implement one since I think it basically ends up as selection by sorting. You ought to be able to do this in O(kN), where k is the "top" number of items by iterating through the collection, keeping track of the minimum "top" element seen so far and if the current element is bigger than that, replacing that element with the current element (and updating the new minimum element). This is space efficient as well.
When you are done you can return the "top" elements as an ordered collection.
Note: I'm assuming LINQ to Objects here. If you are using LINQ to SQL, then I'd defer simply defer the ordering/selection to the SQL server and simply chain the methods appropriately to get a select top N ... from ... order by ...
query.
Completely untested, not even compiled. Uses a generic Fibonacci Heap implementation. I'll post the code on my blog (http://farm-fresh-code.blogspot.com) sometime soon. I've got one hanging around (not sure if it's generic) as a result of some experiments with priority queues that I was doing. See wikipedia for info and pseudocode until then.
public static IEnumerable<TSource> OrderByTop<TSource, TKey>(
IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer,
int topCount)
{
// allocate enough space to hold the number of elements (+1 as a new candidate is added)
FibonacciHeap<TKey,TSource> top = new FibonacciHeap<TKey,TSource>( comparer );
foreach (var candidate in source) // O(n)
{
TKey key = keySelector(candidate);
TKey minimum = top.AccessMinimum();
if (minimum == null || comparer.Compare( key, minimum.Key ) > 0) // O(1)
{
top.Insert( key, candidate ); // O(1)
if (top.Count >= topCount)
{
top.DeleteMinimum(); // O(logk)
}
}
}
return top.ToList().Reverse().Select( t.Value ); // O(k)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With