Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid OrderBy - memory usage problems

Let's assume we have a large list of points List<Point> pointList (already stored in memory) where each Point contains X, Y, and Z coordinate.

Now, I would like to select for example N% of points with biggest Z-values of all points stored in pointList. Right now I'm doing it like that:

N = 0.05; // selecting only 5% of points
double cutoffValue = pointList
    .OrderBy(p=> p.Z) // First bottleneck - creates sorted copy of all data
    .ElementAt((int) pointList.Count * (1 - N)).Z;

List<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue).ToList();

But I have here two memory usage bottlenecks: first during OrderBy (more important) and second during selecting the points (this is less important, because we usually want to select only small amount of points).

Is there any way of replacing OrderBy (or maybe other way of finding this cutoff point) with something that uses less memory?

The problem is quite important, because LINQ copies the whole dataset and for big files I'm processing it sometimes hits few hundreds of MBs.

like image 343
Gacek Avatar asked Jul 25 '10 16:07

Gacek


1 Answers

Write a method that iterates through the list once and maintains a set of the M largest elements. Each step will only require O(log M) work to maintain the set, and you can have O(M) memory and O(N log M) running time.

public static IEnumerable<TSource> TakeLargest<TSource, TKey>
    (this IEnumerable<TSource> items, Func<TSource, TKey> selector, int count)
{
    var set = new SortedDictionary<TKey, List<TSource>>();
    var resultCount = 0;
    var first = default(KeyValuePair<TKey, List<TSource>>);
    foreach (var item in items)
    {
        // If the key is already smaller than the smallest
        // item in the set, we can ignore this item
        var key = selector(item);
        if (first.Value == null ||
            resultCount < count ||
            Comparer<TKey>.Default.Compare(key, first.Key) >= 0)
        {
            // Add next item to set
            if (!set.ContainsKey(key))
            {
                set[key] = new List<TSource>();
            }
            set[key].Add(item);
            if (first.Value == null)
            {
                first = set.First();
            }

            // Remove smallest item from set
            resultCount++;
            if (resultCount - first.Value.Count >= count)
            {
                set.Remove(first.Key);
                resultCount -= first.Value.Count;
                first = set.First();
            }
        }
    }
    return set.Values.SelectMany(values => values);
}

That will include more than count elements if there are ties, as your implementation does now.

like image 200
Quartermeister Avatar answered Sep 20 '22 16:09

Quartermeister