Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it worth using distinct() with collect(toSet())

When collecting the elements of a stream into a set, is there any advantage (or drawback) to also specifying .distinct() on the stream? For example:

return items.stream().map(...).distinct().collect(toSet());

Given that the set will already remove duplicates, this seems redundant, but does it offer any performance advantage or disadvantage? Does the answer depend on whether the stream is parallel/sequential or ordered/unordered?

like image 357
Neil Madden Avatar asked Jan 11 '17 14:01

Neil Madden


People also ask

How do I get unique values from a collection stream?

distinct() returns a stream consisting of distinct elements in a stream. distinct() is the method of Stream interface. This method uses hashCode() and equals() methods to get distinct elements.

How to find duplicate values in hashmap in Java 8?

In Java 8 Stream, filter with Set. Add() is the fastest algorithm to find duplicate elements, because it loops only one time. Set<T> items = new HashSet<>(); return list. stream() .

How to find duplicate elements in List using stream Java?

Get the stream of elements in which the duplicates are to be found. For each element in the stream, count the frequency of each element, using Collections. frequency() method. Then for each element in the collection list, if the frequency of any element is more than one, then this element is a duplicate element.


2 Answers

According to the javadoc, distinct is a stateful intermediate operation.

If you literally have .distinct followed immediately by .collect, it doesn't really add any benefit. Maybe if the .distinct implementation is more performant than the Set duplication check, you might get some benefit, but if you're collecting to a set you're going to end up with the same result anyway.

If, on the other hand, .distinct occurs before your .map operation, and that particular mapping is an expensive operation, you may get some gains there because you're processing less data overall.

like image 171
Steve Chaloner Avatar answered Sep 21 '22 12:09

Steve Chaloner


While you have the same result, they don't do the same thing: toSet() use HashSet, and you lose the initial ordering which is what distinct can preserve if required:

From the javadoc:

Preserving stability for distinct() in parallel pipelines is relatively expensive (requires that the operation act as a full barrier, with substantial buffering overhead), and stability is often not needed. Using an unordered stream source (such as generate(Supplier)) or removing the ordering constraint with BaseStream.unordered() may result in significantly more efficient execution for distinct() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with distinct() in parallel pipelines, switching to sequential execution with BaseStream.sequential() may improve performance.

If you require stability, then it is distinct(). Using toSet() after would be useless (if not required by an API).

That is however useful if you have an equals implementing a partial equality:

class F {
  int a;
  int b;
  @Override int hashCode() {return Objects.hashCode(a);}
  @Override boolean equals(Object other) {
    if (other == this) return true;
    if (!(other instanceof F)) return false;
    return a == ((F)other).a;
  }
}

If you have a = F(10, 1) and b = F(10, 2) they are equals. But not all their fields are equals.

If in the list you have (b, a)

  • With toSet() you won't always have this order. You might have (b, a), etc.
  • With distinct() you preserve this information, eg: (b, a).

This however assume some prerequisites (sequential, etc).

Note: this could be done using a TreeSet and appropriate compareTo method.

like image 27
NoDataFound Avatar answered Sep 21 '22 12:09

NoDataFound