Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java: optimize hashset for large-scale duplicate detection

I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"

I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with

tweetids = new HashSet<String>(220000,0.80F);

and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.

tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
    duplicates++;
    continue; 
}

SOLUTION

Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String> was simply enormous and uncalled for because the String.hashCode() is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long> to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:

import gnu.trove.set.hash.TLongHashSet;
...
    TLongHashSet tweetids; // class variable
... 
    tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
    // inside for(each record)
    String twid = (String) tweet_twitter_data.get("id");
    if (!(tweetids.add(Long.parseLong(twid)))) {
        duplicates++;
        continue; 
    }
like image 854
WorldsEndless Avatar asked May 22 '13 13:05

WorldsEndless


People also ask

How HashSet detect duplicates?

By using HashSet, a general-purpose Set implementation, we can find duplicates in O(n) time. All you need to do is iterate over an array using advanced for loop and insert every element into HashSet. Since it allows only unique elements, add() method will fail and return false when you try to add duplicates.

Is HashSet or ArrayList faster?

As a conclusion, we can learn, that the contains() method works faster in HashSet compared to an ArrayList.

How do I allow duplicates in HashSet?

Duplicates: HashSet doesn't allow duplicate values. HashMap stores key, value pairs and it does not allow duplicate keys. If the key is duplicate then the old key is replaced with the new value.

What is the time complexity of HashSet?

For HashSet, LinkedHashSet, and EnumSet, the add(), remove() and contains() operations cost constant O(1) time thanks to the internal HashMap implementation. Likewise, the TreeSet has O(log(n)) time complexity for the operations listed in the previous group.


2 Answers

You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems

  1. The number of buckets for large hashmaps and hash sets is going to cause a lot of overhead (memory). You can influence this by using some kind of custom hash function and a modulo of e.g. 50000
  2. Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
  3. HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.

Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.

An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.

like image 177
Jilles van Gurp Avatar answered Sep 20 '22 10:09

Jilles van Gurp


If you are just looking for the existence of Strings, then I would suggest you try using a Trie(also called a Prefix Tree). The total space used by a Trie should be less than a HashSet, and it's quicker for string lookups.

The main disadvantage is that it can be slower when used from a harddisk as it's loading a tree, not a stored linearly structure like a Hash. So make sure that it can be held inside of RAM.

The link I gave is a good list of pros/cons of this approach.

*as an aside, the bloom filters suggested by Jilles Van Gurp are great fast prefilters.

like image 36
greedybuddha Avatar answered Sep 18 '22 10:09

greedybuddha