Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding duplicate numbers in an array of numbers

Tags:

I was asked about this during an interview, given a list of numbers, return only the duplicates present in the input as sorted output.

Example:

Input = [6, 7, 5, 6, 1, 0, 1, 0, 5, 3, 2]
Output = [0, 1, 5, 6] - sorted unique numbers which are duplicates in input

I have come up with the below solution:

Approach1:

public static List<Integer> process(List<Integer> input) {
    List<Integer> result = new ArrayList<>();
    Map<Integer, Integer> map = new HashMap<>();
    for (int val : input) {
        map.put(val, map.getOrDefault(val, 0) + 1);
    }

    map.forEach((key, val) -> {
        if (val > 1) {
            result.add(key);
        }
    });
    result.sort(null);
    return result;
}

Updated Approach2:

public static List<Integer> process1(List<Integer> input) {
    Set<Integer> dups = new HashSet<>();
    Set<Integer> set = new HashSet<>();
    for (int val : input) {
        if (set.contains(val)) {
            dups.add(val);
        } else {
            set.add(val);
        }
    }
    List<Integer> result = new ArrayList<>(dups);
    result.sort(null);
    return result;
}

Old Approach2

public static List<Integer> process1(List<Integer> input) {
    List<Integer> result = new ArrayList<>();
    Set<Integer> set = new HashSet<>();
    for (int val : input) {
        if (set.contains(val)) {
            result.add(val);
        } else {
            set.add(val);
        }
    }
    result.sort(null);
    return result;
}

Approach1 time complexity is (n)Log(n) as sorting in java is nlogn, space complexity is n

Approach2 time complexity is (n)Log(n) again as sorting in java is nlogn, space complexity is little less as compared to Approach 1, as I hold elements only once in my set.

Please correct me if I am wrong in finding out time and space complexities.

Now the question is whether this logic works if the input contains million of numbers or not? Whether HashMap works if input is million of numbers?

As per my understanding generally, the time complexity for map or set is less, also HashSet internal implementation uses HashMap. How to answer this question.

like image 984
learner Avatar asked Oct 11 '19 20:10

learner


People also ask

Can you find duplicate numbers in an array?

Duplicate elements can be found using two loops. The outer loop will iterate through the array from 0 to length of the array. The outer loop will select an element. The inner loop will be used to compare the selected element with the rest of the elements of the array.

How do you find duplicate numbers in an array if it contains multiple duplicates?

Simple Approach: The idea is to use nested loop and for each element check if the element is present in the array more than once or not. If present, then store it in a Hash-map.

How do I find duplicate numbers?

Find the Duplicate Number - LeetCode. Given an array of integers nums containing n + 1 integers where each integer is in the range [1, n] inclusive. There is only one repeated number in nums , return this repeated number. You must solve the problem without modifying the array nums and uses only constant extra space.


2 Answers

Approach2 fails if a number is present three or more times as it will add the number to the output multiple times. You're right about less space complexity, but your reasoning is a little strange – it's because HashSet will internally use the same dummy object in its underlying HashMap to indicate a value is present, while for Approach1, you are allocating an Integer every time.

A HashMap internally holds a list of buckets, so generally, if you're able to allocate a list containing a million numbers, you should also be able to allocate a HashMap holding (at most) as many numbers.

It would be a good idea to set the initial capacity of the HashMap when constructing it to the size of the list. This will make your code faster for large lists since it avoids rehashing.

Note that there may be a faster approach: Sorting the initial list. In the sorted list, finding duplicates is trivial since they are adjacent, so you don't need a HashMap. However, you need to copy the initial list for that if you're not allowed to modify it, so space requirements will be the same. Theoretical complexity stays the same (sorting is O(nlogn), finding the duplicates will be O(n)), actual time for sorting will be more since we sort the large list, but you will avoid all allocations in the HashMap. This may or may not make up for the additional time spent sorting the large list.

like image 51
flyx Avatar answered Oct 18 '22 07:10

flyx


I was curious how different implementations of this algorithm would behave under JMH performance tests and the fastest implementation I came up with is:

Set<Integer> all = new HashSet<>(input.size());
Set<Integer> output = new TreeSet<>();

for(Integer val : input) {
   if (!all.add(val)) {
      output.add(val);
   }
}

return new ArrayList<>(output);

Below are JMH results for the above implementation (algo2) and your Approach 1 implementation (algo1):

Benchmark                   (N)  Mode  Cnt    Score    Error  Units
PerformanceTests.algo1  1000000  avgt    3  323.265 ± 33.919  ms/op
PerformanceTests.algo2  1000000  avgt    3  285.505 ± 29.744  ms/op

Update, @josejuan you were right, below is the algorithm 6x faster than the previous ones:

int[] input = new int[INPUT.size()];
for (int i = 0; i < input.length; i++) {
    input[i] = INPUT.get(i);
}
Arrays.sort(input);

List<Integer> output = new ArrayList<>(input.length);
int prev = input[0];
boolean added = false;
for (int i = 1; i < input.length; i++) {
    if (prev == input[i]) {
        if (!added) {
            output.add(prev);
            added = true;
        }
    } else {
        added = false;
        prev = input[i];
    }
}
return output;
like image 31
Adam Siemion Avatar answered Oct 18 '22 08:10

Adam Siemion