Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

[Optimize This]: Slow LINQ to Objects Query

I have this query that is bothering me; it is encapsulated as a new query operator, I made two versions of it, trying to see which one performs better. Both perform horribly.

First attempt; declarative style

public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerable<α> source, int length)
{
    return source.Any()
        ? source.Take(length).Cons(source.Skip(length).Section(length))
        : Enumerable.Empty<IEnumerable<α>>();
}

Second attempt: imperative "yield return" style

public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerable<α> source, int length)
{
    var fst = source.Take(length);
    var rst = source.Skip(length);

    yield return fst;

    if (rst.Any())
        foreach (var section in rst.Section(length))
            yield return section;
}

In fact the second attempt is worse, both in terms of readability, compositionality and in terms of speed.

Any clues as to how to optimize this?

like image 532
Bent Rasmussen Avatar asked Feb 08 '10 14:02

Bent Rasmussen


2 Answers

If I understand your question correctly, you're trying to build a lazy implementation of an enumerator that splits a larger collection of items into smaller enumerable collections of items.

For instance, a sequence of a million numbers could be split up into "sections", each producing only 100 of them, and you want it all done lazily, ie. no collecting 100 items into a list before producing them.

First, your attempts will re-iterate over the collection many times, which is bad, hence the performance problem.

If you're trying to build a pure lazy implementation, you should consider the following issues:

  • You want to iterate over the underlying collection only once
  • You should return enumerables that re-use the underlying enumerator
  • You need to handle that a section you return isn't completely enumerated over (for instance, the calling code only needed the first 50 of those 100 items).

Edit: Before I go into my simplistic solution, here's a few caveats about it:

  • You cannot save each section for later, ie. you cannot do: collection.Sequence(10).ToArray() to get an array of sections.
  • You cannot enumerate over each section more than once, because when you do, it changes a hidden underlying data structure.

Basically: My solution is not general purpose. You should go with @LBushkin's comment about MoreLinq Batch if you need this, and I would hesitate to put my code into a class library, it would have to be local where it is needed, or renamed to something that clearly warns you about the issues with it.


Here's a simplistic implementation, I'm pretty sure there are bugs here, so you might want to look at implementing a ton of unit-testing for edgecases:

using System;
using System.Collections.Generic;
using System.Linq;

namespace ConsoleApplication20
{
    class SectionEnumerable<T> : IEnumerable<T>
    {
        private readonly IEnumerator<T> _Enumerator;

        public SectionEnumerable(IEnumerator<T> enumerator, int sectionSize)
        {
            _Enumerator = enumerator;
            Left = sectionSize;
        }

        public IEnumerator<T> GetEnumerator()
        {
            while (Left > 0)
            {
                Left--;
                yield return _Enumerator.Current;
                if (Left > 0)
                    if (!_Enumerator.MoveNext())
                        break;
            }
        }

        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }

        public int Left { get; private set; }
    }

    static class SequenceExtensions
    {
        public static IEnumerable<IEnumerable<T>> Section<T>(this IEnumerable<T> collection, int sectionSize)
        {
            if (collection == null)
                throw new ArgumentNullException("collection");
            if (sectionSize < 1)
                throw new ArgumentOutOfRangeException("sectionSize");

            using (IEnumerator<T> enumerator = collection.GetEnumerator())
            {
                while (enumerator.MoveNext())
                {
                    SectionEnumerable<T> enumerable = new SectionEnumerable<T>(enumerator, sectionSize);
                    yield return enumerable;
                    for (int index = 0; index < enumerable.Left; index++)
                        if (!enumerator.MoveNext())
                            yield break;
                }
            }
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            var sequence = Enumerable.Range(0, 100);
            var sections = sequence.Section(10);
            foreach (var section in sections)
            {
                Console.WriteLine(
                    String.Join(", ",
                    section.Take(5).ToArray().Select(i => i.ToString()).ToArray()));
            }
            Console.ReadLine();
        }
    }
}

Output:

0, 1, 2, 3, 4
10, 11, 12, 13, 14
20, 21, 22, 23, 24
30, 31, 32, 33, 34
40, 41, 42, 43, 44
50, 51, 52, 53, 54
60, 61, 62, 63, 64
70, 71, 72, 73, 74
80, 81, 82, 83, 84
90, 91, 92, 93, 94

Things you should unit-test:

  • Empty input collection produces no sections
  • A collection that has just the right number of elements, produces only one section
  • A collection that contains a multiplum of the section-size elements, (ie. 10, 20, 30, etc. number of elements with a section size of 5 or 10), doesn't produce an empty section after all the expected ones
  • That it is actually lazy, if you enumerate over the first 10-element section, but only the first 5 of the second section, only the first 15 elements of the underlying collection is enumerated over
like image 185
Lasse V. Karlsen Avatar answered Sep 20 '22 00:09

Lasse V. Karlsen


I suspect that the problem you're having is related to the fact that enumerating the final result is at least an O(n^2) operation, possibly worse; I haven't worked it all out in my head yet.

Why is that? Well, suppose you have [1, 2, 3, 4, 5, 6] and you split that up into what you think is { { 1, 2 }, {3, 4}, {5, 6} }

That's not what you've done. You've in fact split this up into { take the first two, take the first two and discard them and then take the next two, take the first two and discard then and then take the next two and discard them and then take the third two }

Notice how each step along the way re-calculate the result? That's because the array could be changing between calls to the enumeration. LINQ was designed to always get you up-to-date results; you write a query that means "skip the first four and iterate the next two", that's exactly what you get -- a query that executes that code when you enumerate it.

Is the original sequence small enough and fast enough that you can read the whole thing into memory and split it all up at once, rather than trying to do so lazily? Alternatively, is the sequence indexible? If all you get is forward access to the sequence and it is too big or slow to read into memory all at once then there is not a whole lot you can do here. But if you have one or both of those properties then you can make this at least linear.

like image 42
Eric Lippert Avatar answered Sep 18 '22 00:09

Eric Lippert