Can a Java Collectors.groupingBy return a Stream as its list of grouped items?

Tags:

In C# Linq, GroupBy returns an IEnumerable of IGrouping items, which in turn are an IEnumerable of the items of the chosen value type. Here's an example:

var namesAndScores = new Dictionary<string, int>> {
    ["David"] = 90,
    ["Jane"] = 91,
    ["Bill"] = 90,
    ["Tina"] = 89)
};
var IEnumerable<IGrouping<int, string>> namesGroupedByScore =
    namesAndScores
        .GroupBy(
            kvp => kvp.Value,
            kvp => kvp.Key
        );

// Result:
// 90 : { David, Bill }
// 91 : { Jane }
// 89 : { Tina }

Specifically, note that each IGrouping<int, string> is IEnumerable<string> and is not, say, List<string>. (It also has a .Key property.)

The GroupBy obviously has to fully enumerate the input items before it can emit a single grouping, however, since it does emit an IEnumerable<string> instead of a List<string>, there may be a performance benefit if you don't enumerate the entire grouping, such as if you just did .First().

Aside: technically, I suppose GroupBy could wait until you enumerate it to consume a single item from the input, then emit a single IGrouping, and only enumerate the rest of the input as the IGrouping is enumerated, collecting other groups into its internal data structure as it searches for the next item in the current group, but I find that an unlikely and problematic implementation, and expect that GroupBy will enumerate fully at invocation time.

Here's what code with First() would look like:

 var oneStudentForEachNumericScore = namesGroupedByScore
     .ToDictionary(
         grouping => grouping.Key,
         grouping => grouping.First() // does not fully enumerate the values
     );
 // Result:
 // 90 : David -- Bill is missing and we don't care
 // 91 : Jane
 // 89 : Tina

Now in Java Streams, to group, you have to collect, and you can't just give the groupingBy collector a second lambda for extracting the value. If you want a different value than the entire input, you have to map again (though note that the groupingBy collector lets you create multi-level groups of groups of ... of groups in one step). Here is equivalent code to the above C# code:

Map<Integer, List<String>> namesGroupedByScore = namesAndScores
      .entrySet().stream()
      .collect(Collectors.groupingBy(
          Map.Entry::getValue,
          Collectors.mapping(
              Map.Entry::getKey,
              Collectors.toList(),
          )
      ));

This seems less than optimal. So my questions are:

Is there some way to express this more simply, without having to use Collectors.mapping to get the group items to be the value?
Why do we have to collect to a fully enumerated type? Is there a way to simulate the IEnumerable value type of C#'s GroupBy and return a Map<Integer, Stream<String>> from Collectors.mapping(), or would that be of no use because the value items have to be enumerated fully, anyway? Or could we write our own Collectors.groupingBy that takes a lambda for the second argument and does the job for us, making the syntax closer to Linq's GroupBy and having at least cleaner syntax and possibly slightly improved performance?
As a theoretical exercise, even if not practically useful, is it possible to write our own Java Stream Collector toStream() which returns a Stream and does NOT iterate its input until and unless it is enumerated (iterating one element at a time, deferred)?

218

asked May 28 '18 21:05

ErikE

2 Answers

While these operation look similar in some aspects, they are fundamentally different. Unlike Linq’s GroupBy operation, Java’s groupingBy is a Collector, designed to work with the terminal operation collect of the Stream API, which is a not an intermediate operation per se and hence, can’t be used to implement a lazy stream operation in general.

The groupingBy collector uses another downstream Collector for the groups, so instead of streaming over the group’s elements, to perform another operation, you would specify a collector performing that operation in-place, in the best case. While these collectors do not support short-circuiting, they eliminate the need to collect the groups into Lists, just to stream over them. Just consider, e.g. groupingBy(f1, summingInt(f2)). The case of collecting the groups into a List has been considered common enough to make toList() implied when you don’t specify a collector, but this hasn’t been considered for the case of mapping the elements before collecting into a list.

If you encounter this case often enough, it would be easy to define your own collector

public static <T,K,V> Collector<T,?,Map<K,List<V>>> groupingBy(
    Function<? super T, ? extends K> key, Function<? super T, ? extends V> value) {
    return Collectors.groupingBy(key, Collectors.mapping(value, Collectors.toList()));
}

and use it like

Map<Integer,List<String>> result = map.entrySet().stream()
    .collect(groupingBy(Map.Entry::getValue, Map.Entry::getKey));

and, since you are not required to use method references and want to be closer to the Linq original:

Map<Integer,List<String>> result = map.entrySet().stream()
        .collect(groupingBy(kvp -> kvp.getValue(), kvp -> kvp.getKey()));

but, as noted, if you are going to stream over this map afterwards and worrying about the non-laziness of this operation, you probably want to use a different collector than toList() anyway.

While this approach offers some flexibility regarding the resulting values, the Map and its keys are an unavoidable part of this operation, as not only does the Map provide the storage logic, its lookup operation is also responsible for forming the groups, which also determines the semantic. E.g. when you use the variant with a map supplier with () -> new TreeMap<>(customComparator) you may get entirely different groups as with the default HashMap (think of, e.g. String.CASE_INSENSITIVE_ORDER). On the other hand, when you supply an EnumMap, you may not get different semantics but entirely different performance characteristics.

In contrast, the GroupBy operation from Linq you described, looks like an intermediate operation which has no pendant in the Stream API at all. As you suggested yourself, chances are high that it still does a full traversal when the first element has been polled, completely populating a data structure behind the scenes. Even if an implementation tries some laziness, the results are limited. You could cheaply get the first element of the first group but if you’re only interested in that element, you wouldn’t need grouping at all. The second element of the first group might already be the last one of the source stream, requiring a full traversal and storage.

So offering such an operation would imply some complexity with little benefit over collecting eagerly. It’s also hard to imagine parallel capable implementation of it (offering benefits over the collect operation). The actual inconvenience stems not from this design decision, but from the fact that the resulting Map is not a Collection (note that implementing Iterable alone wouldn’t imply having a stream() method) and the decision to separate collection operations and stream operations. These two aspects lead to the requirement to use entrySet().stream() to stream over the map, but that’s outside the scope of this question. And, as said above, if you need this, check first whether a different downstream collector for the groupingBy collector couldn’t provide the desired result in the first place.

For completeness, here is a solution that tries to implement a lazy grouping:

public interface Group<K,V> {
    K key();
    Stream<V> values();
}
public static <T,K,V> Stream<Group<K,V>> group(Stream<T> s,
    Function<? super T, ? extends K> key, Function<? super T, ? extends V> value) {

    return StreamSupport.stream(new Spliterator<Group<K,V>>() {
        final Spliterator<T> sp = s.spliterator();
        final Map<K,GroupImpl<T,K,V>> map = new HashMap<>();
        ArrayDeque<Group<K,V>> pendingGroup = new ArrayDeque<>();
        Consumer<T> c;
        {
        c = t -> map.compute(key.apply(t), (k,g) -> {
            V v = value.apply(t);
            if(g == null) pendingGroup.addLast(g = new GroupImpl<>(k, v, sp, c));
            else g.add(v);
            return g;
        });
        }
        public boolean tryAdvance(Consumer<? super Group<K,V>> action) {
            do {} while(sp.tryAdvance(c) && pendingGroup.isEmpty());
            Group<K,V> g = pendingGroup.pollFirst();
            if(g == null) return false;
            action.accept(g);
            return true;
        }
        public Spliterator<Group<K,V>> trySplit() {
            return null; // that surely doesn't work in parallel
        }
        public long estimateSize() {
            return sp.estimateSize();
        }
        public int characteristics() {
            return ORDERED|NONNULL;
        }
    }, false);
}
static class GroupImpl<T,K,V> implements Group<K,V> {
    private final K key;
    private final V first;
    private final Spliterator<T> source;
    private final Consumer<T> sourceConsumer;
    private List<V> values;

    GroupImpl(K k, V firstValue, Spliterator<T> s, Consumer<T> c) {
        key = k;
        first = firstValue;
        source = s;
        sourceConsumer = c;
    }
    public K key() {
        return key;
    }
    public Stream<V> values() {
        return StreamSupport.stream(
            new Spliterators.AbstractSpliterator<V>(1, Spliterator.ORDERED) {
            int pos;
            public boolean tryAdvance(Consumer<? super V> action) {
                if(pos == 0) {
                    pos++;
                    action.accept(first);
                    return true;
                }
                do {} while((values==null || values.size()<pos)
                           &&source.tryAdvance(sourceConsumer));
                if(values==null || values.size()<pos) return false;
                action.accept(values.get(pos++ -1));
                return true;
            }
        }, false);
    }
    void add(V value) {
        if(values == null) values = new ArrayList<>();
        values.add(value);
    }
}

You can test it with the following example:

group(
    Stream.of("foo", "bar", "baz", "hello", "world", "a", "b", "c")
          .peek(s -> System.out.println("source traversal: "+s)),
        String::length,
        String::toUpperCase)
    .filter(h -> h.values().anyMatch(s -> s.startsWith("B")))
    .findFirst()
    .ifPresent(g -> System.out.println("group with key "+g.key()));

which will print:

source traversal: foo
source traversal: bar
group with key 3

showing that the laziness works as far as possible. But

Each operation that requires to know all groups/keys needs a full traversal of the source, as the very last element could introduce a new group
Each operation that requires to process all elements of at least one group needs a full traversal, as the very last element of the source could belong to that group
The previous point even applies to short-circuiting operations, if they can’t stop early. E.g., in the example above finding a match in the second group implies a unsuccessful full traversal of the first group, hence a full traversal of the source
The example above could be rewritten to
```
Stream.of("foo", "bar", "baz", "hello", "world", "a", "b", "c")
      .peek(s -> System.out.println("source traversal: "+s))
      .filter(s -> s.toUpperCase().startsWith("H"))
      .map(String::length)
      .findFirst()
      .ifPresent(key -> System.out.println("group with key "+key));
```
which offers even better laziness (e.g. if the match is not within the first group).

Of course, the example was contrived, but I have the strong feeling that almost any operation which bears the potential of lazy processing, i.e. does not need all groups and does not need all elements of at least one group, can be rewritten into an operation that doesn’t need grouping at all.

answered Sep 29 '22 14:09

Holger

Here are the solutions for part of your questions by StreamEx and my library AbacusUtil

Map<String, Integer> namesAndScores 
             = N.asMap("David", 90, "Jane", 91, "Bill", 90, "Tina", 89);

// By StreamEx
Map<Integer, List<String>> namesGroupedByScore = EntryStream.of(namesAndScores)
                                .invert().grouping();

// By AbacusUtil
Map<Integer, List<String>> namesGroupedByScore = EntryStream.of(namesAndScores)
                                   .groupTo(Fn.value(), Fn.key());
// Or
Map<Integer, Stream<String>> namesGroupedByScore2 = 
        EntryStream.of(namesAndScores).toMap(Fn.value(), collectingAndThen(mapping(Fn.key()), Stream::of));

If you only want to save the first name after group by:

Map<Integer, List<String>> namesAndScores3 = 
      EntryStream.of(namesAndScores).distinctByValue().groupTo(Fn.value(), Fn.key());
// Or
Map<Integer, String> namesAndScores4 = 
          EntryStream.of(namesAndScores).distinctByValue().toMap(Fn.value(), Fn.key());

If you want to save at most 2 values.

Map<Integer, List<String>> namesAndScores5 = EntryStream.of(namesAndScores).toMap(Fn.value(),
        MoreCollectors.mapping(Fn.key(), MoreCollectors.toList(2)));

For the rest of questions, I believe what Holger has said:"...but I have the strong feeling that almost any operation which bears the potential of lazy processing, i.e. does not need all groups and does not need all elements of at least one group, can be rewritten into an operation that doesn’t need grouping at all."

In any scenario, if groupBy is required, I don't think there exists such implementation without iterating all the elements, regardingless which language you're using. If iterating all the elements is unnecessary, most likely groupBy is unnecessary or misused.

answered Sep 29 '22 15:09

123-xyz

Related questions
                            
                                How to intercept each method call within given method using Spring AOP or AspectJ
                            
                                Does StampedLock.writeLock() release a read-lock held by the current thread?
                            
                                MongoDB Java Driver: Convert BsonDocument to Document and back
                            
                                Effective Java: Safety of Forwarding Classes
                            
                                Run eureka service in a docker container
                            
                                JavaFX Exception in Application constructor(No Such method found exception)
                            
                                Java-8: stream or simpler solution?
                            
                                How to change description language in OpenWeatherMap api in android
                            
                                Java 8 bytecode how are default methods in interfaces marked
                            
                                How to ignore AccessDeniedExceptions when using Files.walkFileTree?
                            
                                Java Simple Factory with constructors using different parameters
                            
                                The Difference Between finish(); and onBackPressed();
                            
                                Can I have more control on zone offset formatting at DateTimeFormatter [duplicate]
                            
                                Uniquely identify source JDBC process in Oracle DB
                            
                                Liquibase preconditions: How do I check for a column being the correct data type?
                            
                                Type checking with generic Suppliers and lambdas
                            
                                How to deploy spring boot jar file to EC2 using jenkins?
                            
                                What to use instead of `defineClass` in Java10?
                            
                                How to assign List<? extends BaseClass> to List<BaseClass>
                            
                                Java multi-threaded program using join() gives wrong results in calculating sum of adjacent numbers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can a Java Collectors.groupingBy return a Stream as its list of grouped items?

Tags:

java

java-stream

collectors

grouping

ErikE

People also ask

2 Answers

Holger

123-xyz

Recent Activity

Donate For Us