I asked a question and got answered here about performance issues which I had a with a large collection of data. (created with linq)
ok , let's leave it aside.
But one of the interesting (and geniusly) optimization - suggested by Marc - was to Batchify
the linq query.
/*1*/ static IEnumerable<T> Batchify<T>(this IEnumerable<T> source, int count)
/*2*/ {
/*3*/ var list = new List<T>(count);
/*4*/ foreach(var item in source)
/*5*/ {
/*6*/ list.Add(item);
/*7*/ if(list.Count == count)
/*8*/ {
/*9*/ foreach (var x in list) yield return x;
/*10*/ list.Clear();
/*11*/ }
/*12*/ }
/*13*/ foreach (var item in list) yield return item;
/*14*/ }
Here, the purpose of Batchify is to ensure that we aren't helping the server too much by taking appreciable time between each operation - the data is invented in batches of 1000 and each batch is made available very quickly.
Now , I understand what it is doing , but I can't tell the difference since I might be missing the way it actually works. ( sometimes you think you know something...till...)
OK , back to basics :
AFAIK , Linq works like this chain –:
So, we can't start enumerating till the result of select
in :
Where-->OrderBy-->Select
was accomplished.
So basically i'm waiting for select
to have all correct data (after where
, after orderby
), and only then - my code can touch those values. (yielded from the select
)
But according to my understanding of Marc's answer , it seems that there is a gap between those yields
which allows other resources to do something... (?)
If so , then between each iteration of #4
, after line#9
, there is a time for CPU to do something else ?
Question
nb
I already know that( for example) select
is nothing but:
public static IEnumerable<TResult> Select<TSource,TResult>
(this IEnumerable<TSource> source, Func<TSource,TResult> selector)
{
foreach (TSource element in source)
yield return selector (elem![enter image description here][3]ent);
}
But if so , my code can't touch it till all values ( after where
, orderby
) were calculated...
For those who ask if there's difference: http://i.stack.imgur.com/19Ojw.jpg
2 seconds for 1M items. 9 seconds for 5M items.
(ignore the second line of time , (extra console.write line).)
here it is for 5m list : http://i.stack.imgur.com/DflGR.jpg ( the first one is withBatchify , the other is not)
Important: the image shown includes OrderBy
: you should note that this breaks batchify here, because OrderBy
is a buffered operator. The batchify method I showed is intended for non-buffered spooling streams.
In the context in which I used it, the origin (before the batchify) was an iterator block that did lots of things involving object creation and pseudo-random number generators on each iteration. Because the code in question was timing sensitive, what I did not want to do was to introduce a reliable pause (for the CPU work of creating each item) between each call to the store. This was in part to emulate the original code, which created all the objects up front, and was in part because I understand how SE.Redis handles socket work.
Let's consider the behaviour without Batchify
:
In particular, this means that there is a predictable pause between store requests. SE.Redis handles socket IO on a dedicated worker thread, and the above could quite easily result in high packet fragmentation, especially since I was using the "fire and forget" flag. The writer thread needs to flush periodically, which it does when either the buffer hits a critical size, or there is no more work in the outbound message queue.
Now consider what batchify does:
here you can hopefully see that the CPU effort between store requests is significantly reduced. This more correctly mimics the original code where a list of millions was created initially, and then iterated. But additionally, it means that there is a very good chance that the thread creating outbound messages can go at least as fast as the writer thread, which means that the outbound queue is unlikely to become zero for any appreciable time. This allows for much lower packet fragmentation, because now instead of having a packet per request, there's a good chance that multiple messages are in each packet. Fewer packets generally means higher bandwidth due to reduced overheads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With