I have a requirement to store 2 to 15 million Accounts (which are a String
of length 15) in a data structure for lookup purpose and checking uniqueness. Initially I planned to store them in a HashSet
, but I doubt the speed of the lookup will be slow because of hash collisions and will eventually be slower than a TreeMap (using Binary search).
There is no requirement for Data to be sorted. I am using Java 7. I have 64G system with 48G dedicated for this application.
This question is not a duplicate of HashSet and TreeSet performance test because that question is about the performance of adding elements to a Set
and this question is about the performance of checking an existing Set
for duplicate values.
Simply put, HashSet is faster than the TreeSet. HashSet provides constant-time performance for most operations like add(), remove() and contains(), versus the log(n) time offered by the TreeSet.
A HashSet is usually used for high-performance operations involving a set of unique data. Since HashSet contains only unique elements, its internal structure is optimized for faster searches. Note that you can store a single null value in a HashSet.
In short, in the underlying implementation, it actually stores an extra dummy object (12 bytes) with each entry of hashset. And a pointer (8 bytes) to next entry. Thus conceding extra 12+8 bytes per entry. So total memory per entry: 16+12+8 = 36 bytes.
HashMap is faster than HashSet because the values are associated to a unique key. In HashSet , member object is used for calculating hashcode value which can be same for two objects so equals() method is used to check for equality.
If you have 48 GB of dedicated Memory for your 2 million to 15 million records, your best bet is probably to use a HashMap<Key, Record>
, where your key is an Integer
or a String
depending on your requirements.
You will be fine as far as hash collisions go as long as you give enough memory to the Map
and have an appropriate load factor.
I recommend using the following constructor: new HashMap<>(13_000_000);
(30% more than your expected number of records - which will be automatically expanded by HashMap
's implementation to 2^24
cells).
Tell your application that this Map
will be very large from the get-go so it doesn't need to automatically grow as you populate it.
HashMap
uses an O(1)
access time for it's members, whereas TreeMap
uses O(log n)
lookup time, but can be more efficient with memory and doesn't need a clever hashing function. However, if you're using String
or Integer
keys, you don't need to worry about designing a hashing function and the constant time lookups will be a huge improvement. Also, another advantage of TreeMap
/ TreeSet
is the sorted ordering, which you stated you don't care about; use HashMap
.
If the only purpose of the list is to check for unique account numbers, then everything I've said above is still true, but as you stated in your question, you should use a HashSet<String>
, not a HashMap
. The performance recommendations and constructor argument is still applicable.
Further reading: HashSet and TreeSet performance test
When we tried to store 50 million records in HashMap with proper initialization parameters, insertion started to slowdown, especially after 35 million records. Changing to TreeMap gave a constant insertion and retrieval performance.
Observation : TreeMap will give better performance than a HashMap for large input set. For a smaller set, of course HashMap will give better performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With