When collecting the elements of a stream into a set, is there any advantage (or drawback) to also specifying .distinct()
on the stream? For example:
return items.stream().map(...).distinct().collect(toSet());
Given that the set will already remove duplicates, this seems redundant, but does it offer any performance advantage or disadvantage? Does the answer depend on whether the stream is parallel/sequential or ordered/unordered?
distinct() returns a stream consisting of distinct elements in a stream. distinct() is the method of Stream interface. This method uses hashCode() and equals() methods to get distinct elements.
In Java 8 Stream, filter with Set. Add() is the fastest algorithm to find duplicate elements, because it loops only one time. Set<T> items = new HashSet<>(); return list. stream() .
Get the stream of elements in which the duplicates are to be found. For each element in the stream, count the frequency of each element, using Collections. frequency() method. Then for each element in the collection list, if the frequency of any element is more than one, then this element is a duplicate element.
According to the javadoc, distinct
is a stateful intermediate operation.
If you literally have .distinct
followed immediately by .collect
, it doesn't really add any benefit. Maybe if the .distinct
implementation is more performant than the Set
duplication check, you might get some benefit, but if you're collecting to a set you're going to end up with the same result anyway.
If, on the other hand, .distinct
occurs before your .map
operation, and that particular mapping is an expensive operation, you may get some gains there because you're processing less data overall.
While you have the same result, they don't do the same thing: toSet()
use HashSet
, and you lose the initial ordering which is what distinct can preserve if required:
From the javadoc:
Preserving stability for distinct() in parallel pipelines is relatively expensive (requires that the operation act as a full barrier, with substantial buffering overhead), and stability is often not needed. Using an unordered stream source (such as generate(Supplier)) or removing the ordering constraint with BaseStream.unordered() may result in significantly more efficient execution for distinct() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with distinct() in parallel pipelines, switching to sequential execution with BaseStream.sequential() may improve performance.
If you require stability, then it is distinct()
. Using toSet()
after would be useless (if not required by an API).
That is however useful if you have an equals
implementing a partial equality:
class F {
int a;
int b;
@Override int hashCode() {return Objects.hashCode(a);}
@Override boolean equals(Object other) {
if (other == this) return true;
if (!(other instanceof F)) return false;
return a == ((F)other).a;
}
}
If you have a = F(10, 1)
and b = F(10, 2)
they are equals. But not all their fields are equals.
If in the list you have (b, a)
toSet()
you won't always have this order. You might have (b, a), etc.(b, a)
.This however assume some prerequisites (sequential, etc).
Note: this could be done using a TreeSet
and appropriate compareTo
method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With