I'm trying to collect stream throwing away rarely used items like in this example:
import java.util.*;
import java.util.function.Function;
import static java.util.stream.Collectors.*;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.containsInAnyOrder;
import org.junit.Test;
@Test
public void shouldFilterCommonlyUsedWords() {
// given
List<String> allWords = Arrays.asList(
"call", "feel", "call", "very", "call", "very", "feel", "very", "any");
// when
Set<String> commonlyUsed = allWords.stream()
.collect(groupingBy(Function.identity(), counting()))
.entrySet().stream().filter(e -> e.getValue() > 2)
.map(Map.Entry::getKey).collect(toSet());
// then
assertThat(commonlyUsed, containsInAnyOrder("call", "very"));
}
I have a feeling that it is possible to do it much simpler - am I right?
A filter stream is constructed on another stream (the underlying stream). The read method in a readable filter stream reads input from the underlying stream, filters it, and passes on the filtered data to the caller.
The groupingBy() method of Collectors class in Java are used for grouping objects by some property and storing results in a Map instance. In order to use it, we always need to specify a property by which the grouping would be performed. This method provides similar functionality to SQL's GROUP BY clause.
There is no way around creating a Map
, unless you want accept a very high CPU complexity.
However, you can remove the second collect
operation:
Map<String,Long> map = allWords.stream()
.collect(groupingBy(Function.identity(), HashMap::new, counting()));
map.values().removeIf(l -> l<=2);
Set<String> commonlyUsed=map.keySet();
Note that in Java 8, HashSet
still wraps a HashMap
, so using the keySet()
of a HashMap
, when you want a Set
in the first place, doesn’t waste space given the current implementation.
Of course, you can hide the post-processing in a Collector
if that feels more “streamy”:
Set<String> commonlyUsed = allWords.stream()
.collect(collectingAndThen(
groupingBy(Function.identity(), HashMap::new, counting()),
map-> { map.values().removeIf(l -> l<=2); return map.keySet(); }));
A while ago I wrote an experimental distinct(atLeast)
method for my library:
public StreamEx<T> distinct(long atLeast) {
if (atLeast <= 1)
return distinct();
AtomicLong nullCount = new AtomicLong();
ConcurrentHashMap<T, Long> map = new ConcurrentHashMap<>();
return filter(t -> {
if (t == null) {
return nullCount.incrementAndGet() == atLeast;
}
return map.merge(t, 1L, (u, v) -> (u + v)) == atLeast;
});
}
So the idea was to use it like this:
Set<String> commonlyUsed = StreamEx.of(allWords).distinct(3).toSet();
This performs a stateful filtration, which looks a little bit ugly. I doubted whether such feature is useful thus I did not merge it into the master branch. Nevertheless it does the job in single stream pass. Probably I should revive it. Meanwhile you can copy this code into the static method and use it like this:
Set<String> commonlyUsed = distinct(allWords.stream(), 3).collect(Collectors.toSet());
Update (2015/05/31): I added the distinct(atLeast)
method to the StreamEx 0.3.1. It's implemented using custom spliterator. Benchmarks showed that this implementation is significantly faster for sequential streams than stateful filtering described above and in many cases it's also faster than other solutions proposed in this topic. Also it works nicely if null
is encountered in the stream (the groupingBy
collector doesn't support null
as class, thus groupingBy
-based solutions will fail if null
is encountered).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With