Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identifying duplicate elements in a list containing 300k+ Strings

I have a list containing 305899 Strings (which is the username for a website). After I remove all the duplicates, the number goes down to 172123 Strings.

I want to find how many times a particular String (the username) is repeated in that ArrayList. I wrote a simple bubble sort type logic but it was too slow.

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
    int duplicate = 0;
    int size = userNameList.size();
    for (int i = 0; i < size - 1; i++) {
        duplicate = 0;
        for (int j = i + 1; j < size; j++) {
            if (userNameList.get(i).equals(userNameList.get(j))) {
                duplicate++;
                userNameList.remove(j);
                j--;
                size--;

            }
        }
        numberOfPosts.put(userNameList.get(i), duplicate);
    }

    return numberOfPosts;
}

Then I changed it to this:

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

    Set<String> unique = new HashSet<String>(userNameList);

    for (String key : unique) {
        numberOfPosts.put(key, Collections.frequency(userNameList, key));
    }

    return numberOfPosts;
}

This was really slow as well. When I mean slow, it would take like 30+ minutes to through the list.

Is there any other efficient way to handle this problem? Just reduce the time it takes to find and count duplicate elements?

like image 942
javaCity Avatar asked Mar 20 '26 03:03

javaCity


1 Answers

Your findNumberOfPosts method is on the right track, but your implementation is doing loads of unnecessary work.
Try this:

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

    for (String userName : userNameList) {
        Integer count = numberOfPosts.get(userName);
        numberOfPosts.put(userName, count == null ? 1 : ++count);
    }
    return numberOfPosts;
}

This should execute in a couple of seconds on most machines.

like image 146
Bohemian Avatar answered Mar 22 '26 10:03

Bohemian