Can someone suggest a way to create batches of a certain size in LINQ?
Ideally I want to be able to perform operations in chunks of some configurable amount.
You don't need to write any code. Use MoreLINQ Batch method, which batches the source sequence into sized buckets (MoreLINQ is available as a NuGet package you can install):
int size = 10;
var batches = sequence.Batch(size);
Which is implemented as:
public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
this IEnumerable<TSource> source, int size)
{
TSource[] bucket = null;
var count = 0;
foreach (var item in source)
{
if (bucket == null)
bucket = new TSource[size];
bucket[count++] = item;
if (count != size)
continue;
yield return bucket;
bucket = null;
count = 0;
}
if (bucket != null && count > 0)
yield return bucket.Take(count).ToArray();
}
public static class MyExtensions
{
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> items,
int maxItems)
{
return items.Select((item, inx) => new { item, inx })
.GroupBy(x => x.inx / maxItems)
.Select(g => g.Select(x => x.item));
}
}
and the usage would be:
List<int> list = new List<int>() { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
foreach(var batch in list.Batch(3))
{
Console.WriteLine(String.Join(",",batch));
}
OUTPUT:
0,1,2
3,4,5
6,7,8
9
An Enumerable.Chunk()
extension method was added to .NET 6.0.
Example:
var list = new List<int> { 1, 2, 3, 4, 5, 6, 7 };
var chunks = list.Chunk(3);
// returns { { 1, 2, 3 }, { 4, 5, 6 }, { 7 } }
For those who cannot upgrade, the source is available on GitHub.
If you start with sequence
defined as an IEnumerable<T>
, and you know that it can safely be enumerated multiple times (e.g. because it is an array or a list), you can just use this simple pattern to process the elements in batches:
while (sequence.Any())
{
var batch = sequence.Take(10);
sequence = sequence.Skip(10);
// do whatever you need to do with each batch here
}
This is a fully lazy, low overhead, one-function implementation of Batch that doesn't do any accumulation. Based on (and fixes issues in) Nick Whaley's solution with help from EricRoller.
Iteration comes directly from the underlying IEnumerable, so elements must be enumerated in strict order, and accessed no more than once. If some elements aren't consumed in an inner loop, they are discarded (and trying to access them again via a saved iterator will throw InvalidOperationException: Enumeration already finished.
).
You can test a complete sample at .NET Fiddle.
public static class BatchLinq
{
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
{
if (size <= 0)
throw new ArgumentOutOfRangeException("size", "Must be greater than zero.");
using (var enumerator = source.GetEnumerator())
while (enumerator.MoveNext())
{
int i = 0;
// Batch is a local function closing over `i` and `enumerator` that
// executes the inner batch enumeration
IEnumerable<T> Batch()
{
do yield return enumerator.Current;
while (++i < size && enumerator.MoveNext());
}
yield return Batch();
while (++i < size && enumerator.MoveNext()); // discard skipped items
}
}
}
All of the above perform terribly with large batches or low memory space. Had to write my own that will pipeline (notice no item accumulation anywhere):
public static class BatchLinq {
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size) {
if (size <= 0)
throw new ArgumentOutOfRangeException("size", "Must be greater than zero.");
using (IEnumerator<T> enumerator = source.GetEnumerator())
while (enumerator.MoveNext())
yield return TakeIEnumerator(enumerator, size);
}
private static IEnumerable<T> TakeIEnumerator<T>(IEnumerator<T> source, int size) {
int i = 0;
do
yield return source.Current;
while (++i < size && source.MoveNext());
}
}
Edit: Known issue with this approach is that each batch must be enumerated and enumerated fully before moving to the next batch. For example this doesn't work:
//Select first item of every 100 items
Batch(list, 100).Select(b => b.First())
I wonder why nobody has ever posted an old school for-loop solution. Here is one:
List<int> source = Enumerable.Range(1,23).ToList();
int batchsize = 10;
for (int i = 0; i < source.Count; i+= batchsize)
{
var batch = source.Skip(i).Take(batchsize);
}
This simplicity is possible because the Take method:
... enumerates
source
and yields elements untilcount
elements have been yielded orsource
contains no more elements. Ifcount
exceeds the number of elements insource
, all elements ofsource
are returned
Disclaimer:
Using Skip and Take inside the loop means that the enumerable will be enumerated multiple times. This is dangerous if the enumerable is deferred. It may result in multiple executions of a database query, or a web request, or a file read. This example is explicitly for the usage of a List which is not deferred, so it is less of a problem. It is still a slow solution since skip will enumerate the collection each time it is called.
This can also be solved using the GetRange
method, but it requires an extra calculation to extract a possible rest batch:
for (int i = 0; i < source.Count; i += batchsize)
{
int remaining = source.Count - i;
var batch = remaining > batchsize ? source.GetRange(i, batchsize) : source.GetRange(i, remaining);
}
Here is a third way to handle this, which works with 2 loops. This ensures that the collection is enumerated only 1 time!:
int batchsize = 10;
List<int> batch = new List<int>(batchsize);
for (int i = 0; i < source.Count; i += batchsize)
{
// calculated the remaining items to avoid an OutOfRangeException
batchsize = source.Count - i > batchsize ? batchsize : source.Count - i;
for (int j = i; j < i + batchsize; j++)
{
batch.Add(source[j]);
}
batch.Clear();
}
Here is an attempted improvement of Nick Whaley's (link) and infogulch's (link) lazy Batch
implementations. This one is strict. You either enumerate the batches in the correct order, or you get an exception.
public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
this IEnumerable<TSource> source, int size)
{
if (size <= 0) throw new ArgumentOutOfRangeException(nameof(size));
using (var enumerator = source.GetEnumerator())
{
int i = 0;
while (enumerator.MoveNext())
{
if (i % size != 0) throw new InvalidOperationException(
"The enumeration is out of order.");
i++;
yield return GetBatch();
}
IEnumerable<TSource> GetBatch()
{
while (true)
{
yield return enumerator.Current;
if (i % size == 0 || !enumerator.MoveNext()) break;
i++;
}
}
}
}
And here is a lazy Batch
implementation for sources of type IList<T>
. This one imposes no restrictions on the enumeration. The batches can be enumerated partially, in any order, and more than once. The restriction of not modifying the collection during the enumeration is still in place though. This is achieved by making a dummy call to enumerator.MoveNext()
before yielding any chunk or element. The downside is that the enumerator is left undisposed, since it is unknown when the enumeration is going to finish.
public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
this IList<TSource> source, int size)
{
if (size <= 0) throw new ArgumentOutOfRangeException(nameof(size));
var enumerator = source.GetEnumerator();
for (int i = 0; i < source.Count; i += size)
{
enumerator.MoveNext();
yield return GetChunk(i, Math.Min(i + size, source.Count));
}
IEnumerable<TSource> GetChunk(int from, int toExclusive)
{
for (int j = from; j < toExclusive; j++)
{
enumerator.MoveNext();
yield return source[j];
}
}
}
Same approach as MoreLINQ, but using List instead of Array. I haven't done benchmarking, but readability matters more to some people:
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
{
List<T> batch = new List<T>();
foreach (var item in source)
{
batch.Add(item);
if (batch.Count >= size)
{
yield return batch;
batch.Clear();
}
}
if (batch.Count > 0)
{
yield return batch;
}
}
Here's the cleanest version of Batch
that I can come up with:
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int count)
{
if (source == null) throw new System.ArgumentNullException("source");
if (count <= 0) throw new System.ArgumentOutOfRangeException("count");
using (var enumerator = source.GetEnumerator())
{
IEnumerable<T> BatchInner()
{
int counter = 0;
do
yield return enumerator.Current;
while (++counter < count && enumerator.MoveNext());
}
while (enumerator.MoveNext())
yield return BatchInner().ToArray();
}
}
Using this code:
Console.WriteLine(String.Join(Environment.NewLine,
Enumerable.Range(0, 20).Batch(8).Select(xs => String.Join(",", xs))));
I get:
0,1,2,3,4,5,6,7
8,9,10,11,12,13,14,15
16,17,18,19
It's important to note that on the answers from "" & "" that this code fails:
var e = Enumerable.Range(0, 20).Batch(8).ToArray();
Console.WriteLine(String.Join(Environment.NewLine, e.Select(xs => String.Join(",", xs))));
Console.WriteLine();
Console.WriteLine(String.Join(Environment.NewLine, e.Select(xs => String.Join(",", xs))));
On their answers it gives:
19
19
19
19
19
19
Due to the inner enumerable not being computed as an array.
So with a functional hat on, this appears trivial....but in C#, there are some significant downsides.
you'd probably view this as an unfold of IEnumerable (google it and you'll probably end up in some Haskell docs, but there may be some F# stuff using unfold, if you know F#, squint at the Haskell docs and it will make sense).
Unfold is related to fold ("aggregate") except rather than iterating through the input IEnumerable, it iterates through the output data structures (its a similar relationship between IEnumerable and IObservable, in fact I think IObservable does implement an "unfold" called generate...)
anyway first you need an unfold method, I think this works (unfortunately it will eventually blow the stack for large "lists"...you can write this safely in F# using yield! rather than concat);
static IEnumerable<T> Unfold<T, U>(Func<U, IEnumerable<Tuple<U, T>>> f, U seed)
{
var maybeNewSeedAndElement = f(seed);
return maybeNewSeedAndElement.SelectMany(x => new[] { x.Item2 }.Concat(Unfold(f, x.Item1)));
}
this is a bit obtuse because C# doesn't implement some of the things functional langauges take for granted...but it basically takes a seed and then generates a "Maybe" answer of the next element in the IEnumerable and the next seed (Maybe doesn't exist in C#, so we've used IEnumerable to fake it), and concatenates the rest of the answer (I can't vouch for the "O(n?)" complexity of this).
Once you've done that then;
static IEnumerable<IEnumerable<T>> Batch<T>(IEnumerable<T> xs, int n)
{
return Unfold(ys =>
{
var head = ys.Take(n);
var tail = ys.Skip(n);
return head.Take(1).Select(_ => Tuple.Create(tail, head));
},
xs);
}
it all looks quite clean...you take the "n" elements as the "next" element in the IEnumerable, and the "tail" is the rest of the unprocessed list.
if there is nothing in the head...you're over...you return "Nothing" (but faked as an empty IEnumerable>)...else you return the head element and the tail to process.
you probably can do this using IObservable, there's probably a "Batch" like method already there, and you can probably use that.
If the risk of stack overflows worries (it probably should), then you should implement in F# (and there's probably some F# library (FSharpX?) already with this).
(I have only done some rudimentary tests of this, so there may be the odd bugs in there).
I'm joining this very late but i found something more interesting.
So we can use here Skip
and Take
for better performance.
public static class MyExtensions
{
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> items, int maxItems)
{
return items.Select((item, index) => new { item, index })
.GroupBy(x => x.index / maxItems)
.Select(g => g.Select(x => x.item));
}
public static IEnumerable<T> Batch2<T>(this IEnumerable<T> items, int skip, int take)
{
return items.Skip(skip).Take(take);
}
}
Next I checked with 100000 records. The looping only is taking more time in case of Batch
Code Of console application.
static void Main(string[] args)
{
List<string> Ids = GetData("First");
List<string> Ids2 = GetData("tsriF");
Stopwatch FirstWatch = new Stopwatch();
FirstWatch.Start();
foreach (var batch in Ids2.Batch(5000))
{
// Console.WriteLine("Batch Ouput:= " + string.Join(",", batch));
}
FirstWatch.Stop();
Console.WriteLine("Done Processing time taken:= "+ FirstWatch.Elapsed.ToString());
Stopwatch Second = new Stopwatch();
Second.Start();
int Length = Ids2.Count;
int StartIndex = 0;
int BatchSize = 5000;
while (Length > 0)
{
var SecBatch = Ids2.Batch2(StartIndex, BatchSize);
// Console.WriteLine("Second Batch Ouput:= " + string.Join(",", SecBatch));
Length = Length - BatchSize;
StartIndex += BatchSize;
}
Second.Stop();
Console.WriteLine("Done Processing time taken Second:= " + Second.Elapsed.ToString());
Console.ReadKey();
}
static List<string> GetData(string name)
{
List<string> Data = new List<string>();
for (int i = 0; i < 100000; i++)
{
Data.Add(string.Format("{0} {1}", name, i.ToString()));
}
return Data;
}
Time taken Is like this.
First - 00:00:00.0708 , 00:00:00.0660
Second (Take and Skip One) - 00:00:00.0008, 00:00:00.0008
I wrote a custom IEnumerable implementation that works without linq and guarantees a single enumeration over the data. It also accomplishes all this without requiring backing lists or arrays that cause memory explosions over large data sets.
Here are some basic tests:
[Fact]
public void ShouldPartition()
{
var ints = new List<int> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
var data = ints.PartitionByMaxGroupSize(3);
data.Count().Should().Be(4);
data.Skip(0).First().Count().Should().Be(3);
data.Skip(0).First().ToList()[0].Should().Be(0);
data.Skip(0).First().ToList()[1].Should().Be(1);
data.Skip(0).First().ToList()[2].Should().Be(2);
data.Skip(1).First().Count().Should().Be(3);
data.Skip(1).First().ToList()[0].Should().Be(3);
data.Skip(1).First().ToList()[1].Should().Be(4);
data.Skip(1).First().ToList()[2].Should().Be(5);
data.Skip(2).First().Count().Should().Be(3);
data.Skip(2).First().ToList()[0].Should().Be(6);
data.Skip(2).First().ToList()[1].Should().Be(7);
data.Skip(2).First().ToList()[2].Should().Be(8);
data.Skip(3).First().Count().Should().Be(1);
data.Skip(3).First().ToList()[0].Should().Be(9);
}
The Extension Method to partition the data.
/// <summary>
/// A set of extension methods for <see cref="IEnumerable{T}"/>.
/// </summary>
public static class EnumerableExtender
{
/// <summary>
/// Splits an enumerable into chucks, by a maximum group size.
/// </summary>
/// <param name="source">The source to split</param>
/// <param name="maxSize">The maximum number of items per group.</param>
/// <typeparam name="T">The type of item to split</typeparam>
/// <returns>A list of lists of the original items.</returns>
public static IEnumerable<IEnumerable<T>> PartitionByMaxGroupSize<T>(this IEnumerable<T> source, int maxSize)
{
return new SplittingEnumerable<T>(source, maxSize);
}
}
This is the implementing class
using System.Collections;
using System.Collections.Generic;
internal class SplittingEnumerable<T> : IEnumerable<IEnumerable<T>>
{
private readonly IEnumerable<T> backing;
private readonly int maxSize;
private bool hasCurrent;
private T lastItem;
public SplittingEnumerable(IEnumerable<T> backing, int maxSize)
{
this.backing = backing;
this.maxSize = maxSize;
}
public IEnumerator<IEnumerable<T>> GetEnumerator()
{
return new Enumerator(this, this.backing.GetEnumerator());
}
IEnumerator IEnumerable.GetEnumerator()
{
return this.GetEnumerator();
}
private class Enumerator : IEnumerator<IEnumerable<T>>
{
private readonly SplittingEnumerable<T> parent;
private readonly IEnumerator<T> backingEnumerator;
private NextEnumerable current;
public Enumerator(SplittingEnumerable<T> parent, IEnumerator<T> backingEnumerator)
{
this.parent = parent;
this.backingEnumerator = backingEnumerator;
this.parent.hasCurrent = this.backingEnumerator.MoveNext();
if (this.parent.hasCurrent)
{
this.parent.lastItem = this.backingEnumerator.Current;
}
}
public bool MoveNext()
{
if (this.current == null)
{
this.current = new NextEnumerable(this.parent, this.backingEnumerator);
return true;
}
else
{
if (!this.current.IsComplete)
{
using (var enumerator = this.current.GetEnumerator())
{
while (enumerator.MoveNext())
{
}
}
}
}
if (!this.parent.hasCurrent)
{
return false;
}
this.current = new NextEnumerable(this.parent, this.backingEnumerator);
return true;
}
public void Reset()
{
throw new System.NotImplementedException();
}
public IEnumerable<T> Current
{
get { return this.current; }
}
object IEnumerator.Current
{
get { return this.Current; }
}
public void Dispose()
{
}
}
private class NextEnumerable : IEnumerable<T>
{
private readonly SplittingEnumerable<T> splitter;
private readonly IEnumerator<T> backingEnumerator;
private int currentSize;
public NextEnumerable(SplittingEnumerable<T> splitter, IEnumerator<T> backingEnumerator)
{
this.splitter = splitter;
this.backingEnumerator = backingEnumerator;
}
public bool IsComplete { get; private set; }
public IEnumerator<T> GetEnumerator()
{
return new NextEnumerator(this.splitter, this, this.backingEnumerator);
}
IEnumerator IEnumerable.GetEnumerator()
{
return this.GetEnumerator();
}
private class NextEnumerator : IEnumerator<T>
{
private readonly SplittingEnumerable<T> splitter;
private readonly NextEnumerable parent;
private readonly IEnumerator<T> enumerator;
private T currentItem;
public NextEnumerator(SplittingEnumerable<T> splitter, NextEnumerable parent, IEnumerator<T> enumerator)
{
this.splitter = splitter;
this.parent = parent;
this.enumerator = enumerator;
}
public bool MoveNext()
{
this.parent.currentSize += 1;
this.currentItem = this.splitter.lastItem;
var hasCcurent = this.splitter.hasCurrent;
this.parent.IsComplete = this.parent.currentSize > this.splitter.maxSize;
if (this.parent.IsComplete)
{
return false;
}
if (hasCcurent)
{
var result = this.enumerator.MoveNext();
this.splitter.lastItem = this.enumerator.Current;
this.splitter.hasCurrent = result;
}
return hasCcurent;
}
public void Reset()
{
throw new System.NotImplementedException();
}
public T Current
{
get { return this.currentItem; }
}
object IEnumerator.Current
{
get { return this.Current; }
}
public void Dispose()
{
}
}
}
}
Another way is using Rx Buffer operator
//using System.Linq;
//using System.Reactive.Linq;
//using System.Reactive.Threading.Tasks;
var observableBatches = anAnumerable.ToObservable().Buffer(size);
var batches = aList.ToObservable().Buffer(size).ToList().ToTask().GetAwaiter().GetResult();
An easy version to use and understand.
public static List<List<T>> chunkList<T>(List<T> listToChunk, int batchSize)
{
List<List<T>> batches = new List<List<T>>();
if (listToChunk.Count == 0) return batches;
bool moreRecords = true;
int fromRecord = 0;
int countRange = 0;
if (listToChunk.Count >= batchSize)
{
countRange = batchSize;
}
else
{
countRange = listToChunk.Count;
}
while (moreRecords)
{
List<T> batch = listToChunk.GetRange(fromRecord, countRange);
batches.Add(batch);
if ((fromRecord + batchSize) >= listToChunk.Count)
{
moreRecords = false;
}
fromRecord = fromRecord + batch.Count;
if ((fromRecord + batchSize) > listToChunk.Count)
{
countRange = listToChunk.Count - fromRecord;
}
else
{
countRange = batchSize;
}
}
return batches;
}
As a new helper method for LINQ in .NET 6 you can chunk any IEnumerable into batches:
int chunkNumber = 1;
foreach (int[] chunk in Enumerable.Range(0, 9).Chunk(3))
{
Console.WriteLine($"Chunk {chunkNumber++}");
foreach (var item in chunk)
{
Console.WriteLine(item);
}
}
Here is an implementation that uses Async iteration in C# via IAsyncEnumerable
- https://docs.microsoft.com/en-us/dotnet/csharp/whats-new/tutorials/generate-consume-asynchronous-stream
public static class EnumerableExtensions
{
/// <summary>
/// Chunks a sequence into a sub-sequences each containing maxItemsPerChunk, except for the last
/// which will contain any items left over.
///
/// NOTE: this implements a streaming implementation via <seealso cref="IAsyncEnumerable{T}"/>.
/// </summary>
public static async IAsyncEnumerable<IEnumerable<T>> ChunkAsync<T>(this IAsyncEnumerable<T> sequence, int maxItemsPerChunk)
{
if (sequence == null) throw new ArgumentNullException(nameof(sequence));
if (maxItemsPerChunk <= 0)
{
throw new ArgumentOutOfRangeException(nameof(maxItemsPerChunk), $"{nameof(maxItemsPerChunk)} must be greater than 0");
}
var chunk = new List<T>(maxItemsPerChunk);
await foreach (var item in sequence)
{
chunk.Add(item);
if (chunk.Count == maxItemsPerChunk)
{
yield return chunk.ToArray();
chunk.Clear();
}
}
// return the "crumbs" that
// didn't make it into a full chunk
if (chunk.Count > 0)
{
yield return chunk.ToArray();
}
}
/// <summary>
/// Chunks a sequence into a sub-sequences each containing maxItemsPerChunk, except for the last
/// which will contain any items left over.
/// </summary>
public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> sequence, int maxItemsPerChunk)
{
if (sequence == null) throw new ArgumentNullException(nameof(sequence));
if (maxItemsPerChunk <= 0)
{
throw new ArgumentOutOfRangeException(nameof(maxItemsPerChunk), $"{nameof(maxItemsPerChunk)} must be greater than 0");
}
var chunk = new List<T>(maxItemsPerChunk);
foreach (var item in sequence)
{
chunk.Add(item);
if (chunk.Count == maxItemsPerChunk)
{
yield return chunk.ToArray();
chunk.Clear();
}
}
// return the "crumbs" that
// didn't make it into a full chunk
if (chunk.Count > 0)
{
yield return chunk.ToArray();
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With