Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating distinct list from existing list in Java 7 and 8?

If I have:

List<Integer> listInts = { 1, 1, 3, 77, 2, 19, 77, 123, 14, 123... }

in Java what is an efficient way of creating a List<Integer> listDistinctInts containing only the distinct values from listInts?

My immediate thought is to create a Set<Integer> setInts containing all the values from listInts then call List<Integer> listDistinctInts = new ArrayList<>(setInts);

But this seems potentially inefficient - is there a better solution using Java 7?

I'm not using Java 8, but I believe using it I could do something like this(?):

List<Integer> listDistinctInts = listInts.stream().distinct().collect(Collectors.toList());

Would this be more performant than the approach above and/or is there any more efficient way of doing this in Java 8?

Finally, (and I'm aware that asking multiple questions might be frowned upon but it's directly related) if I only cared about the count of distinct elements in listInts is there a more efficient way to get that value (in Java 7 and 8) - without first creating a list or set of all the distinct elements?

I'm most interested in native Java ways of accomplishing this and avoiding re-inventing any wheels but would consider hand-rolled code or libraries if they offer better clarity or performance. I've read this related question Java - Distinct List of Objects but it's not entirely clear about the differences in performance between the Java 7 and 8 approaches or whether there might be better techniques?

like image 887
Matt Coubrough Avatar asked Dec 13 '14 23:12

Matt Coubrough


People also ask

How do you make a List distinct in Java?

We'll use the distinct() method from the Stream API, which returns a stream consisting of distinct elements based on the result returned by the equals() method. There we have it, three quick ways to clean up all the duplicate items from a List.

What is the use of distinct () in Java 8?

Java Stream distinct() method returns a new stream of distinct elements. It's useful in removing duplicate elements from the collection before processing them.

How do I get unique values from a collection stream?

distinct() returns a stream consisting of distinct elements in a stream. distinct() is the method of Stream interface. This method uses hashCode() and equals() methods to get distinct elements.

How do I convert a List from one List to another in Java?

Another approach to copying elements is using the addAll method: List<Integer> copy = new ArrayList<>(); copy. addAll(list); It's important to keep in mind whenever using this method that, as with the constructor, the contents of both lists will reference the same objects.


4 Answers

I've now MicroBenchmarked most of the proposed options from the excellent answers provided. Like most non-trivial performance related questions the answer as to which is best is "it depends".

All my testing was performed with JMH Java Microbenchmarking Harness.

Most of these tests were performed using JDK 1.8, although I performed some of the tests with JDK 1.7 too just to ensure that its performance wasn't too different (it was almost identical). I tested the following techniques taken from the answers supplied so far:


1. Java 8 Stream - The solution using stream() I had proprosed as a possibility if using Java8:

public List<Integer> testJava8Stream(List<Integer> listInts) {     return listInts.stream().distinct().collect(Collectors.toList()); } 

pros modern Java 8 approach, no 3rd party dependencies

cons Requires Java 8


2. Adding To List - The solution proposed by Victor2748 where a new list is constructed and added to, if and only if the list doesn't already contain the value. Note that I also preallocate the destination list at the size of the original (the max possible) to prevent any reallocations:

public List<Integer> testAddingToList(List<Integer> listInts) {     List<Integer> listDistinctInts = new ArrayList<>(listInts.size());     for(Integer i : listInts)     {         if( !listDistinctInts.contains(i) ) { listDistinctInts.add(i); }     }     return listDistinctInts; } 

pros Works in any Java version, no need to create a Set and then copy, no 3rd party deps

cons Needs to repeatedly check the List for existing values as we build it


3. GS Collections Fast (now Eclipse collections) - The solution proposed by Craig P. Motlin using the GS Collections library and their custom List type FastList:

public List<Integer> testGsCollectionsFast(FastList listFast) {     return listFast.distinct(); } 

pros Reportedly very quick, simple expressive code, works in Java 7 and 8

cons Requires 3rd party library and a FastList rather than a regular List<Integer>


4. GS Collections Adapted - The FastList solution wasn't quite comparing like-for-like because it needed a FastList passed to the method rather than a good ol' ArrayList<Integer> so I also tested the adapter method Craig proposed:

public List<Integer> testGsCollectionsAdapted(List<Integer> listInts) {     return listAdapter.adapt(listInts).distinct(); } 

pros Doesn't require a FastList, works in Java 7 and 8

cons Has to adapt List so may not perform as well, needs 3rd party library


5. Guava ImmutableSet - The method proposed by Louis Wasserman in comments, and by 卢声远 Shengyuan Lu in their answer using Guava:

public List<Integer> testGuavaImmutable(List<Integer> listInts) {     return ImmutableSet.copyOf(listInts).asList(); } 

pros Reportedly very fast, works in Java 7 or 8

cons Returns an Immutable List, can't handle nulls in the input List, and requires 3rd party library


7. HashSet - My original idea (also recommended by EverV0id, ulix and Radiodef)

public List<Integer> testHashSet(List<Integer> listInts) {     return new ArrayList<Integer>(new HashSet<Integer>(listInts)); } 

pros Works in Java 7 and 8, no 3rd party dependencies

cons Doesn't retain original order of list, has to construct set then copy to list.


6. LinkedHashSet - Since the HashSet solution didn't preserve the order of the Integers in the original list I also tested a version which uses LinkedHashSet to preserve order:

public List<Integer> testLinkedHashSet(List<Integer> listInts) {     return new ArrayList<Integer>(new LinkedHashSet<Integer>(listInts)); } 

pros Retains original ordering, works in Java 7 and 8, no 3rd party dependencies

cons Unlikely to be as fast as regular HashSet approach


Results

Here are my results for various different sizes of listInts (results ordered from slowest to fastest):

1. taking distinct from ArrayList of 100,000 random ints between 0-50,000 (ie. big list, some duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units  AddingToList            thrpt        10        0.505        0.012    ops/s Java8Stream             thrpt        10      234.932       31.959    ops/s LinkedHashSet           thrpt        10      262.185       16.679    ops/s HashSet                 thrpt        10      264.295       24.154    ops/s GsCollectionsAdapted    thrpt        10      357.998       18.468    ops/s GsCollectionsFast       thrpt        10      363.443       40.089    ops/s GuavaImmutable          thrpt        10      469.423       26.056    ops/s 

2. taking distinct from ArrayList of 1000 random ints between 0-50 (ie. medium list, many duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units  AddingToList            thrpt        10    32794.698     1154.113    ops/s HashSet                 thrpt        10    61622.073     2752.557    ops/s LinkedHashSet           thrpt        10    67155.865     1690.119    ops/s Java8Stream             thrpt        10    87440.902    13517.925    ops/s GsCollectionsFast       thrpt        10   103490.738    35302.201    ops/s GsCollectionsAdapted    thrpt        10   143135.973     4733.601    ops/s GuavaImmutable          thrpt        10   186301.330    13421.850    ops/s 

3. taking distinct from ArrayList of 100 random ints between 0-100 (ie. small list, some duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units  AddingToList            thrpt        10   278435.085    14229.285    ops/s Java8Stream             thrpt        10   397664.052    24282.858    ops/s LinkedHashSet           thrpt        10   462701.618    20098.435    ops/s GsCollectionsAdapted    thrpt        10   477097.125    15212.580    ops/s GsCollectionsFast       thrpt        10   511248.923    48155.211    ops/s HashSet                 thrpt        10   512003.713    25886.696    ops/s GuavaImmutable          thrpt        10  1082006.560    18716.012    ops/s 

4. taking distinct from ArrayList of 10 random ints between 0-50 (ie. tiny list, few duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units  Java8Stream             thrpt        10  2739774.758   306124.297    ops/s LinkedHashSet           thrpt        10  3607479.332   150331.918    ops/s HashSet                 thrpt        10  4238393.657   185624.358    ops/s GsCollectionsAdapted    thrpt        10  5919254.755   495444.800    ops/s GsCollectionsFast       thrpt        10  7916079.963  1708778.450    ops/s AddingToList            thrpt        10  7931479.667   966331.036    ops/s GuavaImmutable          thrpt        10  9021621.880   845936.861    ops/s 

Conclusions

  • If you're only taking the distinct items from a list once, and the list isn't very long any of these methods should be adequate.

  • The most efficient general approaches came from the 3rd party libraries: GS Collections and Guava performed admirably.

  • You may need to consider the size of your list and the likely number of duplicates when selecting the most performant method.

  • The naive approach of adding to a new list only if the value isn't already in it works great for tiny lists, but as soon as you have more than a handful of values in the input list it performs the worst of the methods tried.

  • The Guava ImmutableSet.copyOf(listInts).asList() method works the fastest in most situations. But take note of the restrictions: the returned list is Immutable and the input list cannot contain nulls.

  • The HashSet method performs the best of the non 3rd party approaches and usually better than Java 8 streams, but reorders the integers (which may or may not be an issue depending on your use-case).

  • The LinkedHashSet approach keeps the ordering but unsurprisingly was usually worse than the HashSet method.

  • Both the HashSet and LinkedHashSet methods will perform worse when using lists of data types that have complex HashCode calculations, so do your own profiling if you're trying to select distinct Foos from a List<Foo>.

  • If you already have GS Collections as a dependency then it performs very well and is more flexible than the ImmutableList Guava approach. If you don't have it as a dependency, it's worth considering adding it if the performance of selecting distinct items is critical to the performance of your application.

  • Disappointingly Java 8 streams seemed to perform fairly poorly. There may be a better way to code the distinct() call than the way I used, so comments or other answers are of course welcome.

NB. I'm no expert at MicroBenchmarking, so if anyone finds flaws in my results or methodology please notify me and I'll endeavour to correct the Answer.

like image 101
Matt Coubrough Avatar answered Sep 30 '22 12:09

Matt Coubrough


If you're using Eclipse Collections (formerly GS Collections), you can use the method distinct().

ListIterable<Integer> listInts = FastList.newListWith(1, 1, 3, 77, 2, 19, 77, 123, 14, 123); Assert.assertEquals(         FastList.newListWith(1, 3, 77, 2, 19, 123, 14),         listInts.distinct()); 

The advantage of using distinct() instead of converting to a Set and then back to a List is that distinct() preserves the order of the original List, retaining the first occurrence of each element. It's implemented by using both a Set and a List.

MutableSet<T> seenSoFar = UnifiedSet.newSet(); int size = list.size(); for (int i = 0; i < size; i++) {     T item = list.get(i);     if (seenSoFar.add(item))     {         targetCollection.add(item);     } } return targetCollection; 

If you cannot convert your original List into a GS Collections type, you can use ListAdapter to get the same API.

MutableList<Integer> distinct = ListAdapter.adapt(integers).distinct(); 

There's no way to avoid the creation of the Set. Still, UnifiedSet is more efficient than HashSet so there will be some speed benefit.

If all you want is the number of distinct items, it's more efficient to just create a set without creating the list.

Verify.assertSize(7, UnifiedSet.newSet(listInts)); 

Eclipse Collections 8.0 requires Java 8. Eclipse Collections 7.x works well with Java 8, but only requires Java 5.

Note: I am a committer for Eclipse collections.

like image 25
Craig P. Motlin Avatar answered Sep 30 '22 11:09

Craig P. Motlin


You should try new LinkedList(new HashSet(listInts)).

like image 34
Everv0id Avatar answered Sep 30 '22 11:09

Everv0id


Guava can be your choice:

ImmutableSet<Integer> set = ImmutableSet.copyOf(listInts);

The API is extremely optimized.

It is FASTER than listInts.stream().distinct() and new LinkedHashSet<>(listInts).

like image 30
卢声远 Shengyuan Lu Avatar answered Sep 30 '22 13:09

卢声远 Shengyuan Lu