I wrote a LINQ extension method SplitBetween
analogous to String.Split
.
> new List<int>(){3,4,2,21,3,2,17,16,1}
> .SplitBetween(x=>x>=10)
[3,4,2], [3,2], [], [1]
Source:
// partition sequence into sequence of contiguous subsequences
// behaves like String.Split
public static IEnumerable<IEnumerable<T>> SplitBetween<T>(this IEnumerable<T> source,
Func<T, bool> separatorSelector,
bool includeSeparator = false)
{
var l = new List<T>();
foreach (var x in source)
{
if (separatorSelector(x))
{
if (includeSeparator)
{
l.Add(x);
}
yield return l;
l = new List<T>();
}
else
{
l.Add(x);
}
}
yield return l;
}
In the spirit of LINQ I think this method ought to do lazy evaluation. However, my implementation does lazy evaluation over the outer IEnumerable, but not over the inner IEnumerable. How can I fix this?
A demonstration of how the outer behaviour is lazy. Assume ThrowingEnumerable<int>
is an IEnumerable<int>
that explodes when anyone tries to iterate over it (see Skeet's Edulinq).
(new List<int>(){1,2,3,10,1})
.Concat(Extensions.ThrowingEnumerable<int>())
.SplitBetween(x=>x>=10)
.First().ToList();
[1,2,3]
but the inner behaviour isn't lazy
(new List<int>(){1,2,3,10,1})
.Concat(Extensions.ThrowingEnumerable<int>())
.SplitBetween(x=>x>=10)
.ElementAt(2).First();
BOOM
We expect 1 here.
public static IEnumerable<IEnumerable<T>> SplitBetween<T>(this IEnumerable<T> source, Func<T, bool> separatorSelector, bool includeSeparators = false)
{
var state = new SharedState<T>(source, separatorSelector, includeSeparators);
state.LastList = state.NewList = new InnerList<T>(state, 0);
for (; ; )
{
if (state.NewList != null)
{
var newList = state.NewList;
state.NewList = null;
yield return newList.Items();
}
else if (state.IsEnd)
break;
else
state.CheckNext();
}
}
class SharedState<T>
{
public SharedState(IEnumerable<T> source, Func<T, bool> separatorSelector, bool includeSeparators)
{
this.source = source;
this.separatorSelector = separatorSelector;
this.includeSeparators = includeSeparators;
this.iterator = source.GetEnumerator();
this.data = source as IList<T>;
if (data == null)
{
cache = new List<T>();
data = cache;
}
}
public readonly IEnumerable<T> source;
readonly IEnumerator<T> iterator;
public readonly IList<T> data;
readonly List<T> cache;
public readonly Func<T, bool> separatorSelector;
public readonly bool includeSeparators;
public int WaveIndex = -1;
public bool IsEnd = false;
public InnerList<T> NewList;
public InnerList<T> LastList;
public void CheckNext()
{
WaveIndex++;
if (!iterator.MoveNext())
{
if (LastList.LastIndex == null)
LastList.LastIndex = WaveIndex;
IsEnd = true;
}
else
{
var item = iterator.Current;
if (cache != null)
cache.Add(item);
if (separatorSelector(item))
{
LastList.LastIndex = includeSeparators ? WaveIndex + 1 : WaveIndex;
LastList = NewList = new InnerList<T>(this, WaveIndex + 1);
}
}
}
}
class InnerList<T>
{
public InnerList(SharedState<T> state, int startIndex)
{
this.state = state;
this.StartIndex = startIndex;
}
readonly SharedState<T> state;
public readonly int StartIndex;
public int? LastIndex;
public IEnumerable<T> Items()
{
for (var i = StartIndex; ; ++i)
{
if (LastIndex != null && i >= LastIndex)
break;
if (i >= state.WaveIndex)
state.CheckNext();
if (LastIndex == null || i < LastIndex)
yield return state.data[i];
}
}
}
Edit: There is nothing wrong with your approach, except that a throwing enumerable will really "boom" when you enumerate it. Thats what's its meant for. It doesn't have a proper GetEnumerator
defined on it. So your code exhibits no real problem. In the first case by doing First
, you're only enumerating till the first result set (just { 1, 2, 3 }
) is obtained and not enumerating the throwing enumerable (which means Concat
is not being executed). But in the second example, you're asking for element at 2
after the split, which means it will enumerate the throwing enumerable too and will go "boom". The key here is to understand ElementAt
enumerates the collection till the index asked to and is not inherently lazy (it cant be).
I'm not sure if fully lazy is the way to go here. The problem is that the whole process of splitting lazily into outer and inner sequences runs on one enumerator which can yield different results depending on enumerator state. For instance you enumerate only the outer sequence, the inner sequences no longer will be what you expect. Or if you enumerate only half the outer sequence and one inner sequence, what will be the state of other inner sequences? Your approach is the best.
The below approach is lazy (still will boom since that's warranted) in that it uses no intermediate concrete implementations, but can be slower than your original approach because it traverses the list more than once:
public static IEnumerable<IEnumerable<T>> SplitBy<T>(this IEnumerable<T> source,
Func<T, bool> separatorPredicate,
bool includeEmptyEntries = false,
bool includeSeparators = false)
{
int prevIndex = 0;
int lastIndex = 0;
var query = source.Select((t, index) => { lastIndex = index; return new { t, index }; })
.Where(a => separatorPredicate(a.t));
foreach (var item in query)
{
if (item.index == prevIndex && !includeEmptyEntries)
{
prevIndex++;
continue;
}
yield return source.Skip(prevIndex)
.Take(item.index - prevIndex + (!includeSeparators ? 0 : 1));
prevIndex = item.index + 1;
}
if (prevIndex <= lastIndex)
yield return source.Skip(prevIndex);
}
Over all your original approach is the best. If you need something fully lazy, then my below answer fits. Mind you its only meant for things like:
foreach (var inners in outer)
foreach (var item in inners)
{
}
and not
var outer = sequence.Split;
var inner1 = outer.First;
var inner2 = outer.ElementAt; //etc
In other words, not fit for multiple iterations on the same inner sequence. If you are fully aware of this dangerous construct:
This uses no intermediate concrete collections, no ToList
on source enumerable, and is fully lazy/iterator-ish:
public static IEnumerable<IEnumerable<T>> SplitBy<T>(this IEnumerable<T> source,
Func<T, bool> separatorPredicate,
bool includeEmptyEntries = false,
bool includeSeparator = false)
{
using (var x = source.GetEnumerator())
while (x.MoveNext())
if (!separatorPredicate(x.Current))
yield return x.YieldTill(separatorPredicate, includeSeparator);
else if (includeEmptyEntries)
{
if (includeSeparator)
yield return Enumerable.Repeat(x.Current, 1);
else
yield return Enumerable.Empty<T>();
}
}
static IEnumerable<T> YieldTill<T>(this IEnumerator<T> x,
Func<T, bool> separatorPredicate,
bool includeSeparator)
{
yield return x.Current;
while (x.MoveNext())
if (!separatorPredicate(x.Current))
yield return x.Current;
else
{
if (includeSeparator)
yield return x.Current;
yield break;
}
}
Short, sweet and simple. I have added an additional flag to denote if you want to return empty sets (by default it ignores). Without that flag, the code is even more concise.
Thanks for this question, this will be there in my extension methods library! :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With