Java: optimize hashset for large-scale duplicate detection

Tags:

I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"

I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with

tweetids = new HashSet<String>(220000,0.80F);

and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.

tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
    duplicates++;
    continue; 
}

SOLUTION

Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String> was simply enormous and uncalled for because the String.hashCode() is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long> to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:

import gnu.trove.set.hash.TLongHashSet;
...
    TLongHashSet tweetids; // class variable
... 
    tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
    // inside for(each record)
    String twid = (String) tweet_twitter_data.get("id");
    if (!(tweetids.add(Long.parseLong(twid)))) {
        duplicates++;
        continue; 
    }

854

asked May 22 '13 13:05

WorldsEndless

2 Answers

You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems

The number of buckets for large hashmaps and hash sets is going to cause a lot of overhead (memory). You can influence this by using some kind of custom hash function and a modulo of e.g. 50000
Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.

Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.

An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.

177

answered Sep 20 '22 10:09

Jilles van Gurp

If you are just looking for the existence of Strings, then I would suggest you try using a Trie(also called a Prefix Tree). The total space used by a Trie should be less than a HashSet, and it's quicker for string lookups.

The main disadvantage is that it can be slower when used from a harddisk as it's loading a tree, not a stored linearly structure like a Hash. So make sure that it can be held inside of RAM.

The link I gave is a good list of pros/cons of this approach.

*as an aside, the bloom filters suggested by Jilles Van Gurp are great fast prefilters.

answered Sep 18 '22 10:09

greedybuddha

Related questions
                            
                                Jenkins hudson.util.IOException2: remote file operation failed
                            
                                How do I serialize an associated object differently using Jackson and annotations?
                            
                                Types isAssignable and isSubtype misunderstanding
                            
                                In what way do relational operators not obey the compareTo contract with floating point values?
                            
                                System.getProperty("line.separator") Vs "\n" for Android
                            
                                how can I create a thumbnail of a video url in android?
                            
                                What's the difference between using add().addToBackStack(), add().detach() and replace().addToBackStack() in a FragmentTransaction?
                            
                                implicit conversion in java operator +=
                            
                                How to get all pairs of a set efficiently?
                            
                                null-safe Collection contains method
                            
                                Method Order in generated class file by javac
                            
                                How to add overlay in google maps API v2 | Android |
                            
                                Equivalent of C++ shift operator << in Java?
                            
                                How to distinguish 32 bit from 64 bit java version in jnlp files
                            
                                Redirect System.out.println to log
                            
                                Is AES256 encryption decryption possible in Java without unlimited strength JCE files?
                            
                                Strange "reduce" method group in JDK8 bulk collection operations library
                            
                                Java - Wifi API
                            
                                How to use generics in builder pattern
                            
                                The element type "META" must be terminated by the matching end-tag "</META>"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java: optimize hashset for large-scale duplicate detection

Tags:

java

optimization

hashset

duplicate-removal

WorldsEndless

People also ask

2 Answers

Jilles van Gurp

greedybuddha

Recent Activity

Donate For Us