Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Lazy" GroupBy with Linq

Tags:

c#

linq

I have recently been in a situation where I needed to perform an operation a grouped slowly yielding Linq query.

Now, groupBy looses it's lazyness, that means that you have to wait for the entire Sequence to finish until you get any groups returned. This to me logically seems not the best solution, as a group can be returned as soon as it is first encountered.

I have written the following code, which seems to work fine, and am looking for pitfalls and general improvements, as well as thoughts on the concept itself (eg. can/should a groupBy method return groups as soon as possible).

public static IEnumerable<KeyValuePair<R, IEnumerable<T>>> GroupByLazy<T, R>(this IEnumerable<T> source, Func<T, R> keySelector)
        {
            var dic = new Dictionary<R, BlockingCollection<T>>();
            foreach (var item in source)
            {
                var Key = keySelector(item);
                BlockingCollection<T> i;
                if (!dic.TryGetValue(Key, out i))
                {
                    i = new BlockingCollection<T>();
                    i.Add(item);
                    dic.Add(Key, i);
                    yield return new KeyValuePair<R, IEnumerable<T>>(Key, i);
                }
                else i.TryAdd(item);
            }
            // mark all the groups as completed so that enumerations of group-items can finish
            foreach (var groupedValues in dic.Values)
                groupedValues.CompleteAdding();
        }

Simple Test:

var slowIE = Observable.Interval(TimeSpan.FromSeconds(1)).ToEnumerable().Take(10);
            var debug = slowIE.Do(i => Console.WriteLine("\teval " + i));

            var gl = debug.GroupByLazy(i => i % 2 == 0);

            var g = debug.GroupBy(i => i % 2 == 0);

            Console.WriteLine("Lazy:");
            gl.Run(i => Console.WriteLine("Group returned: " + i.Key));
            Console.WriteLine(gl.Single(i => i.Key).Value.Count());

            Console.WriteLine("NonLazy:");
            g.Run(i => Console.WriteLine("Group returned: " + i.Key));
            Console.WriteLine(g.Single(i => i.Key).Count());

            Console.ReadLine();

which prints:

Lazy:
        eval 0
Group returned: True
        eval 1
Group returned: False
        eval 2
        eval 3
        eval 4
        eval 5
        eval 6
        eval 7
        eval 8
        eval 9
NonLazy:
        eval 0
        eval 1
        eval 2
        eval 3
        eval 4
        eval 5
        eval 6
        eval 7
        eval 8
        eval 9
Group returned: True
Group returned: False

As you can see, in my LazyGroupBy the groups are returned as soon as they are first encountered, and can thus be acted upon without waiting for the entire sequence to be grouped.

Thoughts?

Edit: quick thought, I think "Lazy" is not the right term...I'm not a native speaker, what term am I actually looking for?

like image 718
chrisaut Avatar asked Jan 22 '23 15:01

chrisaut


2 Answers

In your solution it appears that the returned groups will change after the group is returned. This may suit some programming patterns, but I don't see it as generally useful.

Imagine that you process a group when it is first returned, and then at some later time a new item is added to the group. How do you know to reprocess the group's members? I imagine some grouped items might never be processed by the caller. Even though CompleteAdding is called, no notification is provided to the consumer of LazyGroupBy.

Again, this may suit some situations, but I can't think of when I would use it offhand.

like image 109
Drew Noakes Avatar answered Feb 01 '23 18:02

Drew Noakes


This "lazy" execution is called deferred execution.

When you return a group, it only contains the first item, and no items will be added to it until you get more groups. So this approach only works if you handle the groups in a separate thread so that the main thread can continue reading the collection, or if you first read all groups and then process them, which of course makes the deferred processing pointless.

Also, you always have to read all groups for the groups to be complete, if you use Take to limit the query, the method won't complete, and the already returned groups might never be completed. This also means that you may have threads still waiting for data that will never be there.

like image 39
Guffa Avatar answered Feb 01 '23 19:02

Guffa