I have an <code>ArrayList</code> of Strings, and I want to find and return all values which exist more than once in the list. Most cases are looking for the opposite (removing the duplicate items like distinct()), and so example code is hard to come by. I was able to come up with this: <pre class="prettyprint"><code>public synchronized List<String> listMatching(List<String> allStrings) { long startTime = System.currentTimeMillis(); List<String> duplicates = allStrings.stream().filter(string -> Collections.frequency(allStrings, string) > 1) .collect(Collectors.toList()); long stopTime = System.currentTimeMillis(); long elapsedTime = stopTime - startTime; LOG.info("Time for Collections.frequency(): "+ elapsedTime); return duplicates; } </code></pre> But this uses <code>Collections.frequency</code>, which loops through the whole list for each item and counts every occurrence. This takes about 150ms to run on my current list of about 4,000 strings. This is a bit slow for me and will only get worse as the list size increases. I took the frequency method and rewrote it to return immediately on the 2nd occurrence: <pre class="prettyprint"><code>protected boolean moreThanOne(Collection<?> c, Object o) { boolean found = false; if (o != null) { for (Object e : c) { if (o.equals(e)) { if (found) { return found; } else { found = true; } } } } return found; } </code></pre> and changed my method to use it: <pre class="prettyprint"><code>public synchronized List<String> listMatching(List<String> allStrings) { long startTime = System.currentTimeMillis(); List<String> duplicates = allStrings.stream().filter(string -> moreThanOne(allStrings, string)) .collect(Collectors.toList()); long stopTime = System.currentTimeMillis(); long elapsedTime = stopTime - startTime; LOG.info("Time for moreThanOne(): "+ elapsedTime); return duplicates; } </code></pre> This seems to work as expected, but does not really increase the speed as much as I was hoping, clocking in at about 120ms. This is probably due to it also needing to loop through the whole list for each item, but I am not sure how to avoid that and still accomplish the task. I know this might seem like premature optimization, but my List can easily be 1mil+, and this method is a critical piece of my app that influences the timing of everything else. Do you see any way that I could further optimize this code? Perhaps using some sort of fancy Predicate? An entirely different approach? EDIT: Thanks to all your suggestions, I was able to come up with something significantly faster: <pre class="prettyprint"><code>public synchronized Set<String> listMatching(List<String> allStrings) { Set<String> allItems = new HashSet<>(); Set<String> duplicates = allStrings.stream() .filter(string -> !allItems.add(string)) .collect(Collectors.toSet()); return duplicates; } </code></pre> Running under the same conditions, this is able to go through my list in <5ms. All the HashMap suggestions would have been great though, if I had needed to know the counts. Not sure why the <code>Collections.frequency()</code> method doesn't use that technique.

An easy way to find duplicates is to iterate over the list and use the add() method to add the item to some other temp set. It will return false if the item already exists in the set. <pre class="prettyprint"><code>public synchronized List<String> listMatching(List<String> allStrings) { Set<String> tempSet = new HashSet(); Set<String> duplicates = new HashSet(); allStrings.forEach( item -> { if (!tempSet.add(item)) duplicates.add(item); }); return duplicates; } </code></pre>

Best data-structure will be <code>Set<String>.</code> <pre class="prettyprint"><code>Add all elements from list in set. </code></pre> <code>Delete elements from set one by one traversing from list.</code> <pre class="prettyprint"><code>If element not found in set then it's duplicate in list. (Because it's already deleted) </code></pre> this will take O(n)+O(n). coding- <pre class="prettyprint"><code> List<String> list = new ArrayList<>(); List<String> duplicates = new ArrayList<>(); list.add("luna"); list.add("mirana"); list.add("mirana"); list.add("mirana"); Set<String> set = new HashSet<>(); set.addAll(list); for(String a:list){ if(set.contains(a)){ set.remove(a); }else{ duplicates.add(a); } } System.out.println(duplicates); </code></pre> Output <pre class="prettyprint"><code>[mirana, mirana] </code></pre>

java 8, most efficient method to return duplicates from a list (not remove them)? [duplicate]

Q: How do I remove duplicates from a list in Java 8?

Remove duplicates in arraylist – Java 8. To remove the duplicates from the arraylist, we can use the java 8 stream api as well. Use steam's distinct() method which returns a stream consisting of the distinct elements comparing by object's equals() method. Collect all district elements as List using Collectors.

Tags:

java

list

optimization

duplicates

java-stream

I have an ArrayList of Strings, and I want to find and return all values which exist more than once in the list. Most cases are looking for the opposite (removing the duplicate items like distinct()), and so example code is hard to come by.

I was able to come up with this:

public synchronized List<String> listMatching(List<String> allStrings) {

    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> Collections.frequency(allStrings, string) > 1)
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for Collections.frequency(): "+ elapsedTime);

    return duplicates;
}

But this uses Collections.frequency, which loops through the whole list for each item and counts every occurrence. This takes about 150ms to run on my current list of about 4,000 strings. This is a bit slow for me and will only get worse as the list size increases. I took the frequency method and rewrote it to return immediately on the 2nd occurrence:

protected boolean moreThanOne(Collection<?> c, Object o) {
    boolean found = false;
    if (o != null) {
        for (Object e : c) {
            if (o.equals(e)) {
                if (found) {
                    return found;
                } else {
                    found = true;
                }
            }
        }
    }
    return found;
}

and changed my method to use it:

public synchronized List<String> listMatching(List<String> allStrings)   {
    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> moreThanOne(allStrings, string))
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for moreThanOne(): "+ elapsedTime);

    return duplicates;
}

This seems to work as expected, but does not really increase the speed as much as I was hoping, clocking in at about 120ms. This is probably due to it also needing to loop through the whole list for each item, but I am not sure how to avoid that and still accomplish the task.

I know this might seem like premature optimization, but my List can easily be 1mil+, and this method is a critical piece of my app that influences the timing of everything else.

Do you see any way that I could further optimize this code? Perhaps using some sort of fancy Predicate? An entirely different approach?

EDIT: Thanks to all your suggestions, I was able to come up with something significantly faster:

public synchronized Set<String> listMatching(List<String> allStrings) {

    Set<String> allItems = new HashSet<>();
    Set<String> duplicates = allStrings.stream()
            .filter(string -> !allItems.add(string))
            .collect(Collectors.toSet());

    return duplicates;
}

Running under the same conditions, this is able to go through my list in <5ms. All the HashMap suggestions would have been great though, if I had needed to know the counts. Not sure why the Collections.frequency() method doesn't use that technique.

849

asked Aug 23 '17 16:08

Jonathon Hoaglin

3 Answers

An easy way to find duplicates is to iterate over the list and use the add() method to add the item to some other temp set. It will return false if the item already exists in the set.

public synchronized List<String> listMatching(List<String> allStrings) {
   Set<String> tempSet = new HashSet();
   Set<String> duplicates = new HashSet();

   allStrings.forEach( item -> {
       if (!tempSet.add(item)) duplicates.add(item);
   });

   return duplicates;
}

106

answered Sep 21 '22 05:09

dspano

A good way to make this really scalable is to build a Map that contains the count of each string. To build the map, you will look up each string in your list. If the string is not yet in the map, put the string and a count of one in the map. If the string is found in the map, increment the count.

You probably want to use some type that allows you to increment the count in-place, rather than having to put() the updated count each time. For example, you can use an int[] with one element.

The other advantage of not re-putting counts is that it is easy to execute in parallel, because you can synchronize on the object that contains your count when you want to read/write the count.

The non-parallel code might look something like this:

Map<String, int[]> map = new HashMap<>(listOfStrings.size());
for (String s: listOfStrings) {
    int[] curCount = map.get(s);
    if (curCount == null) {
        curCount = new int[1];
        curCount[0] = 1;
        map.put(s, curCount);
    } else {
        curCount[0]++;
    }
}

Then you can iterate over the map entries and do the right thing based on the count of each string.

answered Sep 21 '22 05:09

Rob

Best data-structure will be Set<String>.

Add all elements from list in set.

Delete elements from set one by one traversing from list.

If element not found in set then it's duplicate in list. (Because it's already deleted)

this will take O(n)+O(n).

coding-

    List<String> list = new ArrayList<>();
    List<String> duplicates = new ArrayList<>();

    list.add("luna");
    list.add("mirana");
    list.add("mirana");
    list.add("mirana");

    Set<String> set = new HashSet<>();
    set.addAll(list);
    for(String a:list){
        if(set.contains(a)){
            set.remove(a);
        }else{
            duplicates.add(a);
        }
    }
    System.out.println(duplicates);

Output

[mirana, mirana]

answered Sep 21 '22 05:09

nagendra547

Related questions
                            
                                Disable sending null fields via Rest Assured in runtime
                            
                                Failed to resolve Ksoap2
                            
                                Lambda suggestions in Eclipse like IntelliJ does
                            
                                Generic interface for enums in Java
                            
                                How to get array from JSON Object?
                            
                                Floating action button layout anchor not working
                            
                                Spring schedule graceful shutdown not working when using a cron scheduled
                            
                                How are Response and Result in Play Framework related?
                            
                                Returning null with Optional.orElseGet() and setting response value
                            
                                Changing embedded database in Spring Boot from H2 to MySQL
                            
                                How to generate XMLGregorianCalendar time as UTC
                            
                                ShedLock: Running multiple instances runs scheduler tasks multiple times
                            
                                Unit testing ContainerRequestFilter which uses ResourceInfo with mockito
                            
                                Java 8 Lambda List to Map<Int, List<String>>
                            
                                How to call abstract method from abstract class called by inherit class
                            
                                Chaining Functional Interfaces - IntUnaryOperator vs UnaryOperator
                            
                                Error connecting to server localhost on port 1527 with message Connection refused
                            
                                maven-shade-plugin add application version to manifest
                            
                                java.lang.IllegalStateException: Received message from unsupported version: [2.0 .0] minimal compatible version is: [5.0.0]
                            
                                Spring JPA (Hibernate) auto-flatten nested objects into single table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

java 8, most efficient method to return duplicates from a list (not remove them)? [duplicate]

Tags:

java

list

optimization

duplicates

java-stream

Jonathon Hoaglin

People also ask

3 Answers

dspano

Rob

nagendra547

Recent Activity

Donate For Us