Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java 8, most efficient method to return duplicates from a list (not remove them)? [duplicate]

I have an ArrayList of Strings, and I want to find and return all values which exist more than once in the list. Most cases are looking for the opposite (removing the duplicate items like distinct()), and so example code is hard to come by.

I was able to come up with this:

public synchronized List<String> listMatching(List<String> allStrings) {

    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> Collections.frequency(allStrings, string) > 1)
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for Collections.frequency(): "+ elapsedTime);

    return duplicates;
}

But this uses Collections.frequency, which loops through the whole list for each item and counts every occurrence. This takes about 150ms to run on my current list of about 4,000 strings. This is a bit slow for me and will only get worse as the list size increases. I took the frequency method and rewrote it to return immediately on the 2nd occurrence:

protected boolean moreThanOne(Collection<?> c, Object o) {
    boolean found = false;
    if (o != null) {
        for (Object e : c) {
            if (o.equals(e)) {
                if (found) {
                    return found;
                } else {
                    found = true;
                }
            }
        }
    }
    return found;
}

and changed my method to use it:

public synchronized List<String> listMatching(List<String> allStrings)   {
    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> moreThanOne(allStrings, string))
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for moreThanOne(): "+ elapsedTime);

    return duplicates;
}

This seems to work as expected, but does not really increase the speed as much as I was hoping, clocking in at about 120ms. This is probably due to it also needing to loop through the whole list for each item, but I am not sure how to avoid that and still accomplish the task.

I know this might seem like premature optimization, but my List can easily be 1mil+, and this method is a critical piece of my app that influences the timing of everything else.

Do you see any way that I could further optimize this code? Perhaps using some sort of fancy Predicate? An entirely different approach?

EDIT: Thanks to all your suggestions, I was able to come up with something significantly faster:

public synchronized Set<String> listMatching(List<String> allStrings) {

    Set<String> allItems = new HashSet<>();
    Set<String> duplicates = allStrings.stream()
            .filter(string -> !allItems.add(string))
            .collect(Collectors.toSet());

    return duplicates;
}

Running under the same conditions, this is able to go through my list in <5ms. All the HashMap suggestions would have been great though, if I had needed to know the counts. Not sure why the Collections.frequency() method doesn't use that technique.

like image 849
Jonathon Hoaglin Avatar asked Aug 23 '17 16:08

Jonathon Hoaglin


People also ask

How do I remove duplicates from a list in Java 8?

Remove duplicates in arraylist – Java 8. To remove the duplicates from the arraylist, we can use the java 8 stream api as well. Use steam's distinct() method which returns a stream consisting of the distinct elements comparing by object's equals() method. Collect all district elements as List using Collectors.

How do you avoid duplicates in a list in Java?

The easiest way to remove repeated elements is to add the contents to a Set (which will not allow duplicates) and then add the Set back to the ArrayList : Set<String> set = new HashSet<>(yourList); yourList. clear(); yourList. addAll(set);

How HashSet remove duplicates from list in Java?

The easiest way to remove repeated elements is to add the contents to a Set (which will not allow duplicates) and then add the Set back to the ArrayList: List<String> al = new ArrayList<>(); // add elements to al, including duplicates Set<String> hs = new HashSet<>(); hs. addAll(al); al. clear(); al.

Does list remove duplicates Java?

The arraylist contains duplicate elements. Here, we have used the LinkedHashSet to create a set. It is because it removes the duplicate elements and maintains insertion order. To learn more, visit Java LinkedHashSet.


3 Answers

An easy way to find duplicates is to iterate over the list and use the add() method to add the item to some other temp set. It will return false if the item already exists in the set.

public synchronized List<String> listMatching(List<String> allStrings) {
   Set<String> tempSet = new HashSet();
   Set<String> duplicates = new HashSet();

   allStrings.forEach( item -> {
       if (!tempSet.add(item)) duplicates.add(item);
   });

   return duplicates;
}
like image 106
dspano Avatar answered Sep 21 '22 05:09

dspano


A good way to make this really scalable is to build a Map that contains the count of each string. To build the map, you will look up each string in your list. If the string is not yet in the map, put the string and a count of one in the map. If the string is found in the map, increment the count.

You probably want to use some type that allows you to increment the count in-place, rather than having to put() the updated count each time. For example, you can use an int[] with one element.

The other advantage of not re-putting counts is that it is easy to execute in parallel, because you can synchronize on the object that contains your count when you want to read/write the count.

The non-parallel code might look something like this:

Map<String, int[]> map = new HashMap<>(listOfStrings.size());
for (String s: listOfStrings) {
    int[] curCount = map.get(s);
    if (curCount == null) {
        curCount = new int[1];
        curCount[0] = 1;
        map.put(s, curCount);
    } else {
        curCount[0]++;
    }
}

Then you can iterate over the map entries and do the right thing based on the count of each string.

like image 43
Rob Avatar answered Sep 21 '22 05:09

Rob


Best data-structure will be Set<String>.

Add all elements from list in set.

Delete elements from set one by one traversing from list.

If element not found in set then it's duplicate in list. (Because it's already deleted) 

this will take O(n)+O(n).

coding-

    List<String> list = new ArrayList<>();
    List<String> duplicates = new ArrayList<>();

    list.add("luna");
    list.add("mirana");
    list.add("mirana");
    list.add("mirana");

    Set<String> set = new HashSet<>();
    set.addAll(list);
    for(String a:list){
        if(set.contains(a)){
            set.remove(a);
        }else{
            duplicates.add(a);
        }
    }
    System.out.println(duplicates);

Output

[mirana, mirana]
like image 39
nagendra547 Avatar answered Sep 21 '22 05:09

nagendra547