When collecting the elements of a stream into a set, is there any advantage (or drawback) to also specifying <code>.distinct()</code> on the stream? For example: <pre class="prettyprint"><code>return items.stream().map(...).distinct().collect(toSet()); </code></pre> Given that the set will already remove duplicates, this seems redundant, but does it offer any performance advantage or disadvantage? Does the answer depend on whether the stream is parallel/sequential or ordered/unordered?

According to the javadoc, <code>distinct</code> is a stateful intermediate operation. If you literally have <code>.distinct</code> followed immediately by <code>.collect</code>, it doesn't really add any benefit. Maybe if the <code>.distinct</code> implementation is more performant than the <code>Set</code> duplication check, you might get some benefit, but if you're collecting to a set you're going to end up with the same result anyway. If, on the other hand, <code>.distinct</code> occurs before your <code>.map</code> operation, and that particular mapping is an expensive operation, you may get some gains there because you're processing less data overall.

While you have the same result, they don't do the same thing: <code>toSet()</code> use <code>HashSet</code>, and you lose the initial ordering which is what distinct can preserve if required: From the javadoc: <blockquote> Preserving stability for distinct() in parallel pipelines is relatively expensive (requires that the operation act as a full barrier, with substantial buffering overhead), and stability is often not needed. Using an unordered stream source (such as generate(Supplier)) or removing the ordering constraint with BaseStream.unordered() may result in significantly more efficient execution for distinct() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with distinct() in parallel pipelines, switching to sequential execution with BaseStream.sequential() may improve performance. </blockquote> If you require stability, then it is <code>distinct()</code>. Using <code>toSet()</code> after would be useless (if not required by an API). That is however useful if you have an <code>equals</code> implementing a partial equality: <pre class="prettyprint"><code>class F { int a; int b; @Override int hashCode() {return Objects.hashCode(a);} @Override boolean equals(Object other) { if (other == this) return true; if (!(other instanceof F)) return false; return a == ((F)other).a; } } </code></pre> If you have <code>a = F(10, 1)</code> and <code>b = F(10, 2)</code> they are equals. But not all their fields are equals. If in the list you have <code>(b, a)</code> <ul> <li>With <code>toSet()</code> you won't always have this order. You might have (b, a), etc.</li> <li>With distinct() you preserve this information, eg: <code>(b, a)</code>.</li> </ul> This however assume some prerequisites (sequential, etc). Note: this could be done using a <code>TreeSet</code> and appropriate <code>compareTo</code> method.

Is it worth using distinct() with collect(toSet())

Tags:

java

java-stream

When collecting the elements of a stream into a set, is there any advantage (or drawback) to also specifying .distinct() on the stream? For example:

return items.stream().map(...).distinct().collect(toSet());

Given that the set will already remove duplicates, this seems redundant, but does it offer any performance advantage or disadvantage? Does the answer depend on whether the stream is parallel/sequential or ordered/unordered?

357

asked Jan 11 '17 14:01

Neil Madden

2 Answers

According to the javadoc, distinct is a stateful intermediate operation.

If you literally have .distinct followed immediately by .collect, it doesn't really add any benefit. Maybe if the .distinct implementation is more performant than the Set duplication check, you might get some benefit, but if you're collecting to a set you're going to end up with the same result anyway.

If, on the other hand, .distinct occurs before your .map operation, and that particular mapping is an expensive operation, you may get some gains there because you're processing less data overall.

171

answered Sep 21 '22 12:09

Steve Chaloner

While you have the same result, they don't do the same thing: toSet() use HashSet, and you lose the initial ordering which is what distinct can preserve if required:

From the javadoc:

Preserving stability for distinct() in parallel pipelines is relatively expensive (requires that the operation act as a full barrier, with substantial buffering overhead), and stability is often not needed. Using an unordered stream source (such as generate(Supplier)) or removing the ordering constraint with BaseStream.unordered() may result in significantly more efficient execution for distinct() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with distinct() in parallel pipelines, switching to sequential execution with BaseStream.sequential() may improve performance.

If you require stability, then it is distinct(). Using toSet() after would be useless (if not required by an API).

That is however useful if you have an equals implementing a partial equality:

class F {
  int a;
  int b;
  @Override int hashCode() {return Objects.hashCode(a);}
  @Override boolean equals(Object other) {
    if (other == this) return true;
    if (!(other instanceof F)) return false;
    return a == ((F)other).a;
  }
}

If you have a = F(10, 1) and b = F(10, 2) they are equals. But not all their fields are equals.

If in the list you have (b, a)

With toSet() you won't always have this order. You might have (b, a), etc.
With distinct() you preserve this information, eg: (b, a).

This however assume some prerequisites (sequential, etc).

Note: this could be done using a TreeSet and appropriate compareTo method.

answered Sep 21 '22 12:09

NoDataFound

Related questions
                            
                                Converting EBCDIC to ASCII in java
                            
                                Very simple step by step JBehave setup tutorial?
                            
                                What is the max JDBC batch size?
                            
                                Reader#lines() parallelizes badly due to nonconfigurable batch size policy in its spliterator
                            
                                Is there a Java Map implementation that enforces final keys?
                            
                                java.net.SocketException: Connection reset by peer: socket write error When serving a file
                            
                                What is a taglib?
                            
                                MongoDB distinct too big 16mb cap
                            
                                Method overloading with primitives and their wrappers
                            
                                Run each JUnit Test with a Separate ClassLoader (no, really)
                            
                                Spring-boot thymeleaf load HTML file from classpath
                            
                                How can I execute a stored procedure with JPA & Spring Data?
                            
                                Divide Retrofit service declaration into multiple interfaces
                            
                                Where does the Java spec say List<T> assigns to List<? super T>?
                            
                                IntelliJ can't find generated sources when used in same module
                            
                                Currently connected bluetooth device android
                            
                                Java beep sound: Produce sound of some specific frequencies
                            
                                Polymorphism doesn't work in method arguments in Java
                            
                                Using Different Location for Eclipse's .p2 tree ("bundle pool")
                            
                                Instant toString prepends plus

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With