Should I use a `HashSet` or a `TreeSet` for a very large dataset?

Tags:

I have a requirement to store 2 to 15 million Accounts (which are a String of length 15) in a data structure for lookup purpose and checking uniqueness. Initially I planned to store them in a HashSet, but I doubt the speed of the lookup will be slow because of hash collisions and will eventually be slower than a TreeMap (using Binary search).

There is no requirement for Data to be sorted. I am using Java 7. I have 64G system with 48G dedicated for this application.

This question is not a duplicate of HashSet and TreeSet performance test because that question is about the performance of adding elements to a Set and this question is about the performance of checking an existing Set for duplicate values.

837

asked Aug 04 '15 04:08

Mohan

2 Answers

If you have 48 GB of dedicated Memory for your 2 million to 15 million records, your best bet is probably to use a HashMap<Key, Record>, where your key is an Integer or a String depending on your requirements.

You will be fine as far as hash collisions go as long as you give enough memory to the Map and have an appropriate load factor.

I recommend using the following constructor: new HashMap<>(13_000_000); (30% more than your expected number of records - which will be automatically expanded by HashMap's implementation to 2^24 cells). Tell your application that this Map will be very large from the get-go so it doesn't need to automatically grow as you populate it.

HashMap uses an O(1) access time for it's members, whereas TreeMap uses O(log n) lookup time, but can be more efficient with memory and doesn't need a clever hashing function. However, if you're using String or Integer keys, you don't need to worry about designing a hashing function and the constant time lookups will be a huge improvement. Also, another advantage of TreeMap / TreeSet is the sorted ordering, which you stated you don't care about; use HashMap.

If the only purpose of the list is to check for unique account numbers, then everything I've said above is still true, but as you stated in your question, you should use a HashSet<String>, not a HashMap. The performance recommendations and constructor argument is still applicable.

Further reading: HashSet and TreeSet performance test

137

answered Nov 08 '22 10:11

durron597

When we tried to store 50 million records in HashMap with proper initialization parameters, insertion started to slowdown, especially after 35 million records. Changing to TreeMap gave a constant insertion and retrieval performance.

Observation : TreeMap will give better performance than a HashMap for large input set. For a smaller set, of course HashMap will give better performance.

answered Nov 08 '22 09:11

Mohan

Related questions
                            
                                What is a JmsTemplate callback in Spring JMS?
                            
                                How to edit .jar file in Android Studio
                            
                                Print 1 to 100 using 10 threads in java
                            
                                Parsing a Json array using Play Framework and Java
                            
                                Reading from a Text file in Android Studio Java
                            
                                Different Netty versions and their purposes
                            
                                Why does my compare methd throw IllegalArgumentException sometimes?
                            
                                Unsupported add/addAll operations for Map<K,V>.keySet()
                            
                                Hibernate with Sql Server fail for nvarchar field with "No Dialect mapping..."
                            
                                Java Google Analytics API - Read timed out
                            
                                JPA difference between transaction isolation and entity locking
                            
                                API to integrate with Domino
                            
                                Java 8 lambdas grouping by multiple fields
                            
                                Android button background color changes button size
                            
                                Eclipse Mars: Content Assistant doesn't complete when pressing dot
                            
                                Download and write a file with Retrofit and RxJava
                            
                                How to use Space in column name with JPA and Hibernate
                            
                                Java Spring MVC template Engine
                            
                                HashSet contains() method
                            
                                Java methods and generics

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Should I use a `HashSet` or a `TreeSet` for a very large dataset?

Tags:

java

performance

hashset

treeset

Mohan

People also ask

2 Answers

durron597

Mohan

Recent Activity

Donate For Us