Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Batchify long Linq operations?

I asked a question and got answered here about performance issues which I had a with a large collection of data. (created with linq)

ok , let's leave it aside.

But one of the interesting (and geniusly) optimization - suggested by Marc - was to Batchify the linq query.

/*1*/ static IEnumerable<T> Batchify<T>(this IEnumerable<T> source, int count)
/*2*/    {
/*3*/      var list = new List<T>(count);
/*4*/      foreach(var item in source)
/*5*/         {
/*6*/           list.Add(item);
/*7*/           if(list.Count == count)
/*8*/             {
/*9*/               foreach (var x in list) yield return x;
/*10*/              list.Clear();
/*11*/            }
/*12*/        }
/*13*/      foreach (var item in list) yield return item;
/*14*/    }

Here, the purpose of Batchify is to ensure that we aren't helping the server too much by taking appreciable time between each operation - the data is invented in batches of 1000 and each batch is made available very quickly.

Now , I understand what it is doing , but I can't tell the difference since I might be missing the way it actually works. ( sometimes you think you know something...till...)

OK , back to basics :

AFAIK , Linq works like this chain –:

enter image description here

So, we can't start enumerating till the result of select in :

Where-->OrderBy-->Select 

was accomplished.

So basically i'm waiting for select to have all correct data (after where , after orderby), and only then - my code can touch those values. (yielded from the select)

But according to my understanding of Marc's answer , it seems that there is a gap between those yields which allows other resources to do something... (?)

If so , then between each iteration of #4 , after line#9 , there is a time for CPU to do something else ?

Question

  • Can someone shed a light please ? how does this work ?

nb

I already know that( for example) select is nothing but:

public static IEnumerable<TResult> Select<TSource,TResult>
 (this IEnumerable<TSource> source, Func<TSource,TResult> selector)
{
   foreach (TSource element in source)
       yield return selector (elem![enter image description here][3]ent);
}

But if so , my code can't touch it till all values ( after where , orderby) were calculated...

edit :

For those who ask if there's difference: http://i.stack.imgur.com/19Ojw.jpg

2 seconds for 1M items. 9 seconds for 5M items.

(ignore the second line of time , (extra console.write line).) enter image description here

here it is for 5m list : http://i.stack.imgur.com/DflGR.jpg ( the first one is withBatchify , the other is not)

enter image description here

like image 832
Royi Namir Avatar asked May 13 '14 08:05

Royi Namir


Video Answer


1 Answers

Important: the image shown includes OrderBy: you should note that this breaks batchify here, because OrderBy is a buffered operator. The batchify method I showed is intended for non-buffered spooling streams.

In the context in which I used it, the origin (before the batchify) was an iterator block that did lots of things involving object creation and pseudo-random number generators on each iteration. Because the code in question was timing sensitive, what I did not want to do was to introduce a reliable pause (for the CPU work of creating each item) between each call to the store. This was in part to emulate the original code, which created all the objects up front, and was in part because I understand how SE.Redis handles socket work.

Let's consider the behaviour without Batchify:

  • create an item (CPU work) and yield it
  • send it to the store (network IO)
  • create an item (CPU work) and yield it
  • send it to the store (network IO)
  • create an item (CPU work) and yield it
  • send it to the store (network IO)
  • ...

In particular, this means that there is a predictable pause between store requests. SE.Redis handles socket IO on a dedicated worker thread, and the above could quite easily result in high packet fragmentation, especially since I was using the "fire and forget" flag. The writer thread needs to flush periodically, which it does when either the buffer hits a critical size, or there is no more work in the outbound message queue.

Now consider what batchify does:

  • create an item (CPU work) and buffer it
  • create an item (CPU work) and buffer it
  • ...
  • create an item (CPU work) and buffer it
  • yield an item
  • send it to the store (network IO)
  • yield an item
  • send it to the store (network IO)
  • ...
  • yield an item
  • send it to the store (network IO)
  • create an item (CPU work) and buffer it
  • ...

here you can hopefully see that the CPU effort between store requests is significantly reduced. This more correctly mimics the original code where a list of millions was created initially, and then iterated. But additionally, it means that there is a very good chance that the thread creating outbound messages can go at least as fast as the writer thread, which means that the outbound queue is unlikely to become zero for any appreciable time. This allows for much lower packet fragmentation, because now instead of having a packet per request, there's a good chance that multiple messages are in each packet. Fewer packets generally means higher bandwidth due to reduced overheads.

like image 123
Marc Gravell Avatar answered Oct 14 '22 14:10

Marc Gravell