Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can a Java Collectors.groupingBy return a Stream as its list of grouped items?

In C# Linq, GroupBy returns an IEnumerable of IGrouping items, which in turn are an IEnumerable of the items of the chosen value type. Here's an example:

var namesAndScores = new Dictionary<string, int>> {
    ["David"] = 90,
    ["Jane"] = 91,
    ["Bill"] = 90,
    ["Tina"] = 89)
};
var IEnumerable<IGrouping<int, string>> namesGroupedByScore =
    namesAndScores
        .GroupBy(
            kvp => kvp.Value,
            kvp => kvp.Key
        );

// Result:
// 90 : { David, Bill }
// 91 : { Jane }
// 89 : { Tina }

Specifically, note that each IGrouping<int, string> is IEnumerable<string> and is not, say, List<string>. (It also has a .Key property.)

The GroupBy obviously has to fully enumerate the input items before it can emit a single grouping, however, since it does emit an IEnumerable<string> instead of a List<string>, there may be a performance benefit if you don't enumerate the entire grouping, such as if you just did .First().

Aside: technically, I suppose GroupBy could wait until you enumerate it to consume a single item from the input, then emit a single IGrouping, and only enumerate the rest of the input as the IGrouping is enumerated, collecting other groups into its internal data structure as it searches for the next item in the current group, but I find that an unlikely and problematic implementation, and expect that GroupBy will enumerate fully at invocation time.

Here's what code with First() would look like:

 var oneStudentForEachNumericScore = namesGroupedByScore
     .ToDictionary(
         grouping => grouping.Key,
         grouping => grouping.First() // does not fully enumerate the values
     );
 // Result:
 // 90 : David -- Bill is missing and we don't care
 // 91 : Jane
 // 89 : Tina

Now in Java Streams, to group, you have to collect, and you can't just give the groupingBy collector a second lambda for extracting the value. If you want a different value than the entire input, you have to map again (though note that the groupingBy collector lets you create multi-level groups of groups of ... of groups in one step). Here is equivalent code to the above C# code:

Map<Integer, List<String>> namesGroupedByScore = namesAndScores
      .entrySet().stream()
      .collect(Collectors.groupingBy(
          Map.Entry::getValue,
          Collectors.mapping(
              Map.Entry::getKey,
              Collectors.toList(),
          )
      ));

This seems less than optimal. So my questions are:

  1. Is there some way to express this more simply, without having to use Collectors.mapping to get the group items to be the value?
  2. Why do we have to collect to a fully enumerated type? Is there a way to simulate the IEnumerable value type of C#'s GroupBy and return a Map<Integer, Stream<String>> from Collectors.mapping(), or would that be of no use because the value items have to be enumerated fully, anyway? Or could we write our own Collectors.groupingBy that takes a lambda for the second argument and does the job for us, making the syntax closer to Linq's GroupBy and having at least cleaner syntax and possibly slightly improved performance?
  3. As a theoretical exercise, even if not practically useful, is it possible to write our own Java Stream Collector toStream() which returns a Stream and does NOT iterate its input until and unless it is enumerated (iterating one element at a time, deferred)?
like image 218
ErikE Avatar asked May 28 '18 21:05

ErikE


People also ask

How does collectors groupingBy work?

The groupingBy() method of Collectors class in Java are used for grouping objects by some property and storing results in a Map instance. In order to use it, we always need to specify a property by which the grouping would be performed. This method provides similar functionality to SQL's GROUP BY clause.

Which method of collector class can be used to store the result of a Stream in map?

The toMap collector can be used to collect Stream elements into a Map instance. To do this, we need to provide two functions: keyMapper.

Which method can be used to get the Stream object from any collection?

The collect() method of Stream class can be used to accumulate elements of any Stream into a Collection.

What does collectors do in Java?

A Collector is specified by four functions that work together to accumulate entries into a mutable result container, and optionally perform a final transform on the result. They are: creation of a new result container ( supplier() ) incorporating a new data element into a result container ( accumulator() )


2 Answers

While these operation look similar in some aspects, they are fundamentally different. Unlike Linq’s GroupBy operation, Java’s groupingBy is a Collector, designed to work with the terminal operation collect of the Stream API, which is a not an intermediate operation per se and hence, can’t be used to implement a lazy stream operation in general.

The groupingBy collector uses another downstream Collector for the groups, so instead of streaming over the group’s elements, to perform another operation, you would specify a collector performing that operation in-place, in the best case. While these collectors do not support short-circuiting, they eliminate the need to collect the groups into Lists, just to stream over them. Just consider, e.g. groupingBy(f1, summingInt(f2)). The case of collecting the groups into a List has been considered common enough to make toList() implied when you don’t specify a collector, but this hasn’t been considered for the case of mapping the elements before collecting into a list.

If you encounter this case often enough, it would be easy to define your own collector

public static <T,K,V> Collector<T,?,Map<K,List<V>>> groupingBy(
    Function<? super T, ? extends K> key, Function<? super T, ? extends V> value) {
    return Collectors.groupingBy(key, Collectors.mapping(value, Collectors.toList()));
}

and use it like

Map<Integer,List<String>> result = map.entrySet().stream()
    .collect(groupingBy(Map.Entry::getValue, Map.Entry::getKey));

and, since you are not required to use method references and want to be closer to the Linq original:

Map<Integer,List<String>> result = map.entrySet().stream()
        .collect(groupingBy(kvp -> kvp.getValue(), kvp -> kvp.getKey()));

but, as noted, if you are going to stream over this map afterwards and worrying about the non-laziness of this operation, you probably want to use a different collector than toList() anyway.

While this approach offers some flexibility regarding the resulting values, the Map and its keys are an unavoidable part of this operation, as not only does the Map provide the storage logic, its lookup operation is also responsible for forming the groups, which also determines the semantic. E.g. when you use the variant with a map supplier with () -> new TreeMap<>(customComparator) you may get entirely different groups as with the default HashMap (think of, e.g. String.CASE_INSENSITIVE_ORDER). On the other hand, when you supply an EnumMap, you may not get different semantics but entirely different performance characteristics.

In contrast, the GroupBy operation from Linq you described, looks like an intermediate operation which has no pendant in the Stream API at all. As you suggested yourself, chances are high that it still does a full traversal when the first element has been polled, completely populating a data structure behind the scenes. Even if an implementation tries some laziness, the results are limited. You could cheaply get the first element of the first group but if you’re only interested in that element, you wouldn’t need grouping at all. The second element of the first group might already be the last one of the source stream, requiring a full traversal and storage.

So offering such an operation would imply some complexity with little benefit over collecting eagerly. It’s also hard to imagine parallel capable implementation of it (offering benefits over the collect operation). The actual inconvenience stems not from this design decision, but from the fact that the resulting Map is not a Collection (note that implementing Iterable alone wouldn’t imply having a stream() method) and the decision to separate collection operations and stream operations. These two aspects lead to the requirement to use entrySet().stream() to stream over the map, but that’s outside the scope of this question. And, as said above, if you need this, check first whether a different downstream collector for the groupingBy collector couldn’t provide the desired result in the first place.

For completeness, here is a solution that tries to implement a lazy grouping:

public interface Group<K,V> {
    K key();
    Stream<V> values();
}
public static <T,K,V> Stream<Group<K,V>> group(Stream<T> s,
    Function<? super T, ? extends K> key, Function<? super T, ? extends V> value) {

    return StreamSupport.stream(new Spliterator<Group<K,V>>() {
        final Spliterator<T> sp = s.spliterator();
        final Map<K,GroupImpl<T,K,V>> map = new HashMap<>();
        ArrayDeque<Group<K,V>> pendingGroup = new ArrayDeque<>();
        Consumer<T> c;
        {
        c = t -> map.compute(key.apply(t), (k,g) -> {
            V v = value.apply(t);
            if(g == null) pendingGroup.addLast(g = new GroupImpl<>(k, v, sp, c));
            else g.add(v);
            return g;
        });
        }
        public boolean tryAdvance(Consumer<? super Group<K,V>> action) {
            do {} while(sp.tryAdvance(c) && pendingGroup.isEmpty());
            Group<K,V> g = pendingGroup.pollFirst();
            if(g == null) return false;
            action.accept(g);
            return true;
        }
        public Spliterator<Group<K,V>> trySplit() {
            return null; // that surely doesn't work in parallel
        }
        public long estimateSize() {
            return sp.estimateSize();
        }
        public int characteristics() {
            return ORDERED|NONNULL;
        }
    }, false);
}
static class GroupImpl<T,K,V> implements Group<K,V> {
    private final K key;
    private final V first;
    private final Spliterator<T> source;
    private final Consumer<T> sourceConsumer;
    private List<V> values;

    GroupImpl(K k, V firstValue, Spliterator<T> s, Consumer<T> c) {
        key = k;
        first = firstValue;
        source = s;
        sourceConsumer = c;
    }
    public K key() {
        return key;
    }
    public Stream<V> values() {
        return StreamSupport.stream(
            new Spliterators.AbstractSpliterator<V>(1, Spliterator.ORDERED) {
            int pos;
            public boolean tryAdvance(Consumer<? super V> action) {
                if(pos == 0) {
                    pos++;
                    action.accept(first);
                    return true;
                }
                do {} while((values==null || values.size()<pos)
                           &&source.tryAdvance(sourceConsumer));
                if(values==null || values.size()<pos) return false;
                action.accept(values.get(pos++ -1));
                return true;
            }
        }, false);
    }
    void add(V value) {
        if(values == null) values = new ArrayList<>();
        values.add(value);
    }
}

You can test it with the following example:

group(
    Stream.of("foo", "bar", "baz", "hello", "world", "a", "b", "c")
          .peek(s -> System.out.println("source traversal: "+s)),
        String::length,
        String::toUpperCase)
    .filter(h -> h.values().anyMatch(s -> s.startsWith("B")))
    .findFirst()
    .ifPresent(g -> System.out.println("group with key "+g.key()));

which will print:

source traversal: foo
source traversal: bar
group with key 3

showing that the laziness works as far as possible. But

  • Each operation that requires to know all groups/keys needs a full traversal of the source, as the very last element could introduce a new group
  • Each operation that requires to process all elements of at least one group needs a full traversal, as the very last element of the source could belong to that group
  • The previous point even applies to short-circuiting operations, if they can’t stop early. E.g., in the example above finding a match in the second group implies a unsuccessful full traversal of the first group, hence a full traversal of the source
  • The example above could be rewritten to

    Stream.of("foo", "bar", "baz", "hello", "world", "a", "b", "c")
          .peek(s -> System.out.println("source traversal: "+s))
          .filter(s -> s.toUpperCase().startsWith("H"))
          .map(String::length)
          .findFirst()
          .ifPresent(key -> System.out.println("group with key "+key));
    

    which offers even better laziness (e.g. if the match is not within the first group).

    Of course, the example was contrived, but I have the strong feeling that almost any operation which bears the potential of lazy processing, i.e. does not need all groups and does not need all elements of at least one group, can be rewritten into an operation that doesn’t need grouping at all.

like image 94
Holger Avatar answered Sep 29 '22 14:09

Holger


Here are the solutions for part of your questions by StreamEx and my library AbacusUtil

Map<String, Integer> namesAndScores 
             = N.asMap("David", 90, "Jane", 91, "Bill", 90, "Tina", 89);

// By StreamEx
Map<Integer, List<String>> namesGroupedByScore = EntryStream.of(namesAndScores)
                                .invert().grouping();

// By AbacusUtil
Map<Integer, List<String>> namesGroupedByScore = EntryStream.of(namesAndScores)
                                   .groupTo(Fn.value(), Fn.key());
// Or
Map<Integer, Stream<String>> namesGroupedByScore2 = 
        EntryStream.of(namesAndScores).toMap(Fn.value(), collectingAndThen(mapping(Fn.key()), Stream::of));

If you only want to save the first name after group by:

Map<Integer, List<String>> namesAndScores3 = 
      EntryStream.of(namesAndScores).distinctByValue().groupTo(Fn.value(), Fn.key());
// Or
Map<Integer, String> namesAndScores4 = 
          EntryStream.of(namesAndScores).distinctByValue().toMap(Fn.value(), Fn.key());

If you want to save at most 2 values.

Map<Integer, List<String>> namesAndScores5 = EntryStream.of(namesAndScores).toMap(Fn.value(),
        MoreCollectors.mapping(Fn.key(), MoreCollectors.toList(2)));

For the rest of questions, I believe what Holger has said:"...but I have the strong feeling that almost any operation which bears the potential of lazy processing, i.e. does not need all groups and does not need all elements of at least one group, can be rewritten into an operation that doesn’t need grouping at all."

In any scenario, if groupBy is required, I don't think there exists such implementation without iterating all the elements, regardingless which language you're using. If iterating all the elements is unnecessary, most likely groupBy is unnecessary or misused.

like image 27
123-xyz Avatar answered Sep 29 '22 15:09

123-xyz