Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

stream().collect(Collectors.toSet()) vs stream().distinct().collect(Collectors.toList())

If i have a list (~200 elements) of objects, with only few unique objects (~20 elements). I want to have only unique values. Between list.stream().collect(Collectors.toSet()) and list.stream().distinct().collect(Collectors.toList()) which is more efficient wrt latency and memory consumption ?

like image 635
Laxmikant Avatar asked Feb 26 '18 17:02

Laxmikant


People also ask

What does collect collectors toList ()) do?

The toList() method of Collectors Class is a static (class) method. It returns a Collector Interface that gathers the input data onto a new list. This method never guarantees type, mutability, serializability, or thread-safety of the returned list but for more control toCollection(Supplier) method can be used.

Does toSet remove duplicates?

In Kotlin, we can use toSet() function available in Collection functions to remove duplicates. Note: Maintain the original order of items.

What does Stream distinct do?

Java Stream distinct() method returns a new stream of distinct elements. It's useful in removing duplicate elements from the collection before processing them.

What does collect () do in Java?

collect() is one of the Java 8's Stream API's terminal methods. It allows us to perform mutable fold operations (repackaging elements to some data structures and applying some additional logic, concatenating them, etc.) on data elements held in a Stream instance.


2 Answers

While the answer is pretty obvious - don't bother with these details of speed and memory consumption for this little amount of elements and the fact that one returns a Set and the other a List; there are some interesting small details (interesting IMO).

Suppose you are streaming from a source that is already known to be distinct, in such a case your .distinct() operation will be a NO-OP; because there is no need to actually do anything.

If you are streaming from a List (which is by design ordered) and there are no intermediate operations (unordered for example) that change the order, .distinct() will be forced to preserve the order, by using a LinkedHashSet internally - pretty expensive.

If you are doing parallel processing, list.stream().collect(Collectors.toSet()) version will merge multiple HashSets (in 9 this has been slightly improved vs 8), .distinct() on the other hand, will spin a ConcurrentHashMap that will keep all the keys with a dummy Boolean.TRUE value (it's also doing something interesting to preserve the null that your stream might have - even this internally is handled differently in two cases)

like image 141
Eugene Avatar answered Jan 01 '23 21:01

Eugene


A Set (typically HashSet) consumes more than a List (typically ArrayList), mainly because of the hashing table that it stores. But with so few elements, you will not get a noticeable difference in terms of memory consumption.
Instead, which you should care about is that these collectors return different things : a List and a Set that have their own specificities, particularly as as you access to their elements.
So use the way that matches to what you want to perform with this collection.

like image 21
davidxxx Avatar answered Jan 01 '23 22:01

davidxxx