I have this query that is bothering me; it is encapsulated as a new query operator, I made two versions of it, trying to see which one performs better. Both perform horribly.
First attempt; declarative style
public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerable<α> source, int length)
{
return source.Any()
? source.Take(length).Cons(source.Skip(length).Section(length))
: Enumerable.Empty<IEnumerable<α>>();
}
Second attempt: imperative "yield return" style
public static IEnumerable<IEnumerable<α>> Section<α>(this IEnumerable<α> source, int length)
{
var fst = source.Take(length);
var rst = source.Skip(length);
yield return fst;
if (rst.Any())
foreach (var section in rst.Section(length))
yield return section;
}
In fact the second attempt is worse, both in terms of readability, compositionality and in terms of speed.
Any clues as to how to optimize this?
If I understand your question correctly, you're trying to build a lazy implementation of an enumerator that splits a larger collection of items into smaller enumerable collections of items.
For instance, a sequence of a million numbers could be split up into "sections", each producing only 100 of them, and you want it all done lazily, ie. no collecting 100 items into a list before producing them.
First, your attempts will re-iterate over the collection many times, which is bad, hence the performance problem.
If you're trying to build a pure lazy implementation, you should consider the following issues:
Edit: Before I go into my simplistic solution, here's a few caveats about it:
collection.Sequence(10).ToArray()
to get an array of sections.Basically: My solution is not general purpose. You should go with @LBushkin's comment about MoreLinq Batch if you need this, and I would hesitate to put my code into a class library, it would have to be local where it is needed, or renamed to something that clearly warns you about the issues with it.
Here's a simplistic implementation, I'm pretty sure there are bugs here, so you might want to look at implementing a ton of unit-testing for edgecases:
using System;
using System.Collections.Generic;
using System.Linq;
namespace ConsoleApplication20
{
class SectionEnumerable<T> : IEnumerable<T>
{
private readonly IEnumerator<T> _Enumerator;
public SectionEnumerable(IEnumerator<T> enumerator, int sectionSize)
{
_Enumerator = enumerator;
Left = sectionSize;
}
public IEnumerator<T> GetEnumerator()
{
while (Left > 0)
{
Left--;
yield return _Enumerator.Current;
if (Left > 0)
if (!_Enumerator.MoveNext())
break;
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
public int Left { get; private set; }
}
static class SequenceExtensions
{
public static IEnumerable<IEnumerable<T>> Section<T>(this IEnumerable<T> collection, int sectionSize)
{
if (collection == null)
throw new ArgumentNullException("collection");
if (sectionSize < 1)
throw new ArgumentOutOfRangeException("sectionSize");
using (IEnumerator<T> enumerator = collection.GetEnumerator())
{
while (enumerator.MoveNext())
{
SectionEnumerable<T> enumerable = new SectionEnumerable<T>(enumerator, sectionSize);
yield return enumerable;
for (int index = 0; index < enumerable.Left; index++)
if (!enumerator.MoveNext())
yield break;
}
}
}
}
class Program
{
static void Main(string[] args)
{
var sequence = Enumerable.Range(0, 100);
var sections = sequence.Section(10);
foreach (var section in sections)
{
Console.WriteLine(
String.Join(", ",
section.Take(5).ToArray().Select(i => i.ToString()).ToArray()));
}
Console.ReadLine();
}
}
}
Output:
0, 1, 2, 3, 4
10, 11, 12, 13, 14
20, 21, 22, 23, 24
30, 31, 32, 33, 34
40, 41, 42, 43, 44
50, 51, 52, 53, 54
60, 61, 62, 63, 64
70, 71, 72, 73, 74
80, 81, 82, 83, 84
90, 91, 92, 93, 94
Things you should unit-test:
I suspect that the problem you're having is related to the fact that enumerating the final result is at least an O(n^2) operation, possibly worse; I haven't worked it all out in my head yet.
Why is that? Well, suppose you have [1, 2, 3, 4, 5, 6] and you split that up into what you think is { { 1, 2 }, {3, 4}, {5, 6} }
That's not what you've done. You've in fact split this up into { take the first two, take the first two and discard them and then take the next two, take the first two and discard then and then take the next two and discard them and then take the third two }
Notice how each step along the way re-calculate the result? That's because the array could be changing between calls to the enumeration. LINQ was designed to always get you up-to-date results; you write a query that means "skip the first four and iterate the next two", that's exactly what you get -- a query that executes that code when you enumerate it.
Is the original sequence small enough and fast enough that you can read the whole thing into memory and split it all up at once, rather than trying to do so lazily? Alternatively, is the sequence indexible? If all you get is forward access to the sequence and it is too big or slow to read into memory all at once then there is not a whole lot you can do here. But if you have one or both of those properties then you can make this at least linear.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With