In C# Linq, GroupBy
returns an IEnumerable
of IGrouping
items, which in turn are an IEnumerable
of the items of the chosen value type. Here's an example:
var namesAndScores = new Dictionary<string, int>> {
["David"] = 90,
["Jane"] = 91,
["Bill"] = 90,
["Tina"] = 89)
};
var IEnumerable<IGrouping<int, string>> namesGroupedByScore =
namesAndScores
.GroupBy(
kvp => kvp.Value,
kvp => kvp.Key
);
// Result:
// 90 : { David, Bill }
// 91 : { Jane }
// 89 : { Tina }
Specifically, note that each IGrouping<int, string>
is IEnumerable<string>
and is not, say, List<string>
. (It also has a .Key
property.)
The GroupBy
obviously has to fully enumerate the input items before it can emit a single grouping, however, since it does emit an IEnumerable<string>
instead of a List<string>
, there may be a performance benefit if you don't enumerate the entire grouping, such as if you just did .First()
.
Aside: technically, I suppose GroupBy
could wait until you enumerate it to consume a single item from the input, then emit a single IGrouping
, and only enumerate the rest of the input as the IGrouping
is enumerated, collecting other groups into its internal data structure as it searches for the next item in the current group, but I find that an unlikely and problematic implementation, and expect that GroupBy
will enumerate fully at invocation time.
Here's what code with First()
would look like:
var oneStudentForEachNumericScore = namesGroupedByScore
.ToDictionary(
grouping => grouping.Key,
grouping => grouping.First() // does not fully enumerate the values
);
// Result:
// 90 : David -- Bill is missing and we don't care
// 91 : Jane
// 89 : Tina
Now in Java Streams, to group, you have to collect, and you can't just give the groupingBy
collector a second lambda for extracting the value. If you want a different value than the entire input, you have to map again (though note that the groupingBy
collector lets you create multi-level groups of groups of ... of groups in one step). Here is equivalent code to the above C# code:
Map<Integer, List<String>> namesGroupedByScore = namesAndScores
.entrySet().stream()
.collect(Collectors.groupingBy(
Map.Entry::getValue,
Collectors.mapping(
Map.Entry::getKey,
Collectors.toList(),
)
));
This seems less than optimal. So my questions are:
Collectors.mapping
to get the group items to be the value?IEnumerable
value type of C#'s GroupBy
and return a Map<Integer, Stream<String>>
from Collectors.mapping()
, or would that be of no use because the value items have to be enumerated fully, anyway? Or could we write our own Collectors.groupingBy
that takes a lambda for the second argument and does the job for us, making the syntax closer to Linq's GroupBy
and having at least cleaner syntax and possibly slightly improved performance?toStream()
which returns a Stream
and does NOT iterate its input until and unless it is enumerated (iterating one element at a time, deferred)?The groupingBy() method of Collectors class in Java are used for grouping objects by some property and storing results in a Map instance. In order to use it, we always need to specify a property by which the grouping would be performed. This method provides similar functionality to SQL's GROUP BY clause.
The toMap collector can be used to collect Stream elements into a Map instance. To do this, we need to provide two functions: keyMapper.
The collect() method of Stream class can be used to accumulate elements of any Stream into a Collection.
A Collector is specified by four functions that work together to accumulate entries into a mutable result container, and optionally perform a final transform on the result. They are: creation of a new result container ( supplier() ) incorporating a new data element into a result container ( accumulator() )
While these operation look similar in some aspects, they are fundamentally different. Unlike Linq’s GroupBy
operation, Java’s groupingBy
is a Collector
, designed to work with the terminal operation collect
of the Stream API, which is a not an intermediate operation per se and hence, can’t be used to implement a lazy stream operation in general.
The groupingBy
collector uses another downstream Collector
for the groups, so instead of streaming over the group’s elements, to perform another operation, you would specify a collector performing that operation in-place, in the best case. While these collectors do not support short-circuiting, they eliminate the need to collect the groups into List
s, just to stream over them. Just consider, e.g. groupingBy(f1, summingInt(f2))
. The case of collecting the groups into a List
has been considered common enough to make toList()
implied when you don’t specify a collector, but this hasn’t been considered for the case of mapping the elements before collecting into a list.
If you encounter this case often enough, it would be easy to define your own collector
public static <T,K,V> Collector<T,?,Map<K,List<V>>> groupingBy(
Function<? super T, ? extends K> key, Function<? super T, ? extends V> value) {
return Collectors.groupingBy(key, Collectors.mapping(value, Collectors.toList()));
}
and use it like
Map<Integer,List<String>> result = map.entrySet().stream()
.collect(groupingBy(Map.Entry::getValue, Map.Entry::getKey));
and, since you are not required to use method references and want to be closer to the Linq original:
Map<Integer,List<String>> result = map.entrySet().stream()
.collect(groupingBy(kvp -> kvp.getValue(), kvp -> kvp.getKey()));
but, as noted, if you are going to stream over this map afterwards and worrying about the non-laziness of this operation, you probably want to use a different collector than toList()
anyway.
While this approach offers some flexibility regarding the resulting values, the Map
and its keys are an unavoidable part of this operation, as not only does the Map
provide the storage logic, its lookup operation is also responsible for forming the groups, which also determines the semantic. E.g. when you use the variant with a map supplier with () -> new TreeMap<>(customComparator)
you may get entirely different groups as with the default HashMap
(think of, e.g. String.CASE_INSENSITIVE_ORDER
). On the other hand, when you supply an EnumMap
, you may not get different semantics but entirely different performance characteristics.
In contrast, the GroupBy
operation from Linq you described, looks like an intermediate operation which has no pendant in the Stream API at all. As you suggested yourself, chances are high that it still does a full traversal when the first element has been polled, completely populating a data structure behind the scenes. Even if an implementation tries some laziness, the results are limited. You could cheaply get the first element of the first group but if you’re only interested in that element, you wouldn’t need grouping at all. The second element of the first group might already be the last one of the source stream, requiring a full traversal and storage.
So offering such an operation would imply some complexity with little benefit over collecting eagerly. It’s also hard to imagine parallel capable implementation of it (offering benefits over the collect
operation). The actual inconvenience stems not from this design decision, but from the fact that the resulting Map
is not a Collection
(note that implementing Iterable
alone wouldn’t imply having a stream()
method) and the decision to separate collection operations and stream operations. These two aspects lead to the requirement to use entrySet().stream()
to stream over the map, but that’s outside the scope of this question. And, as said above, if you need this, check first whether a different downstream collector for the groupingBy
collector couldn’t provide the desired result in the first place.
For completeness, here is a solution that tries to implement a lazy grouping:
public interface Group<K,V> {
K key();
Stream<V> values();
}
public static <T,K,V> Stream<Group<K,V>> group(Stream<T> s,
Function<? super T, ? extends K> key, Function<? super T, ? extends V> value) {
return StreamSupport.stream(new Spliterator<Group<K,V>>() {
final Spliterator<T> sp = s.spliterator();
final Map<K,GroupImpl<T,K,V>> map = new HashMap<>();
ArrayDeque<Group<K,V>> pendingGroup = new ArrayDeque<>();
Consumer<T> c;
{
c = t -> map.compute(key.apply(t), (k,g) -> {
V v = value.apply(t);
if(g == null) pendingGroup.addLast(g = new GroupImpl<>(k, v, sp, c));
else g.add(v);
return g;
});
}
public boolean tryAdvance(Consumer<? super Group<K,V>> action) {
do {} while(sp.tryAdvance(c) && pendingGroup.isEmpty());
Group<K,V> g = pendingGroup.pollFirst();
if(g == null) return false;
action.accept(g);
return true;
}
public Spliterator<Group<K,V>> trySplit() {
return null; // that surely doesn't work in parallel
}
public long estimateSize() {
return sp.estimateSize();
}
public int characteristics() {
return ORDERED|NONNULL;
}
}, false);
}
static class GroupImpl<T,K,V> implements Group<K,V> {
private final K key;
private final V first;
private final Spliterator<T> source;
private final Consumer<T> sourceConsumer;
private List<V> values;
GroupImpl(K k, V firstValue, Spliterator<T> s, Consumer<T> c) {
key = k;
first = firstValue;
source = s;
sourceConsumer = c;
}
public K key() {
return key;
}
public Stream<V> values() {
return StreamSupport.stream(
new Spliterators.AbstractSpliterator<V>(1, Spliterator.ORDERED) {
int pos;
public boolean tryAdvance(Consumer<? super V> action) {
if(pos == 0) {
pos++;
action.accept(first);
return true;
}
do {} while((values==null || values.size()<pos)
&&source.tryAdvance(sourceConsumer));
if(values==null || values.size()<pos) return false;
action.accept(values.get(pos++ -1));
return true;
}
}, false);
}
void add(V value) {
if(values == null) values = new ArrayList<>();
values.add(value);
}
}
You can test it with the following example:
group(
Stream.of("foo", "bar", "baz", "hello", "world", "a", "b", "c")
.peek(s -> System.out.println("source traversal: "+s)),
String::length,
String::toUpperCase)
.filter(h -> h.values().anyMatch(s -> s.startsWith("B")))
.findFirst()
.ifPresent(g -> System.out.println("group with key "+g.key()));
which will print:
source traversal: foo
source traversal: bar
group with key 3
showing that the laziness works as far as possible. But
The example above could be rewritten to
Stream.of("foo", "bar", "baz", "hello", "world", "a", "b", "c")
.peek(s -> System.out.println("source traversal: "+s))
.filter(s -> s.toUpperCase().startsWith("H"))
.map(String::length)
.findFirst()
.ifPresent(key -> System.out.println("group with key "+key));
which offers even better laziness (e.g. if the match is not within the first group).
Of course, the example was contrived, but I have the strong feeling that almost any operation which bears the potential of lazy processing, i.e. does not need all groups and does not need all elements of at least one group, can be rewritten into an operation that doesn’t need grouping at all.
Here are the solutions for part of your questions by StreamEx and my library AbacusUtil
Map<String, Integer> namesAndScores
= N.asMap("David", 90, "Jane", 91, "Bill", 90, "Tina", 89);
// By StreamEx
Map<Integer, List<String>> namesGroupedByScore = EntryStream.of(namesAndScores)
.invert().grouping();
// By AbacusUtil
Map<Integer, List<String>> namesGroupedByScore = EntryStream.of(namesAndScores)
.groupTo(Fn.value(), Fn.key());
// Or
Map<Integer, Stream<String>> namesGroupedByScore2 =
EntryStream.of(namesAndScores).toMap(Fn.value(), collectingAndThen(mapping(Fn.key()), Stream::of));
If you only want to save the first name after group by:
Map<Integer, List<String>> namesAndScores3 =
EntryStream.of(namesAndScores).distinctByValue().groupTo(Fn.value(), Fn.key());
// Or
Map<Integer, String> namesAndScores4 =
EntryStream.of(namesAndScores).distinctByValue().toMap(Fn.value(), Fn.key());
If you want to save at most 2 values.
Map<Integer, List<String>> namesAndScores5 = EntryStream.of(namesAndScores).toMap(Fn.value(),
MoreCollectors.mapping(Fn.key(), MoreCollectors.toList(2)));
For the rest of questions, I believe what Holger has said:"...but I have the strong feeling that almost any operation which bears the potential of lazy processing, i.e. does not need all groups and does not need all elements of at least one group, can be rewritten into an operation that doesn’t need grouping at all."
In any scenario, if groupBy
is required, I don't think there exists such implementation without iterating all the elements, regardingless which language you're using. If iterating all the elements is unnecessary, most likely groupBy
is unnecessary or misused.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With