Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should we use HashSet?

A HashSet is backed by a HashMap. From it's JavaDoc:

This class implements the Set interface, backed by a hash table (actually a HashMap instance)

When taking a look at the source we can also see how they relate to each other:

// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();
public boolean add(E e) {
    return map.put(e, PRESENT)==null;
}

Therefore a HashSet<E> is backed by a HashMap<E,Object>. For all HashSets in our application we have one reference object PRESENT that we use in the HashMap for the value. While the memory needed to store PRESENT is neglectable, we still store a reference to it for each value in the map.

Would it not be more efficient to use null instead of PRESENT? A further consideration then is should we forgo the HashSet altogether and directly use a HashMap, given the circumstance permits the use of a Map instead of a Set.

My basic problem that triggered these thoughts is the following situation: I have a collection of objects on with the following properties:

  • big collection of objects > 30'000
  • Insertion order is not relevant
  • Efficient check if an item is contained
  • Adding new items to the collection is not relevant The chosen solution should perform optimal in the context to the above criteria as well as minimize memory consumption. On this basis the datastructures HashSet and HashMap spring to mind. When thinking about alternative approaches, the key question is:

How to check containement efficiently?

The only answer that comes to my mind is using the items hash to calculate the storage location. I might be missing something here. Are there any other approaches?

I had a look at various issues, that did shed some light on the issue, but not quietly answered my question:

  • Java : HashSet vs. HashMap
  • clarifying facts behind Java's implementation of HashSet/HashMap
  • Java HashSet vs HashMap

I am not looking for suggestions of any alternative libraries or framework to address this, but I want to understand if there is an other way to think about efficient containement checking of an element in a Collection.

like image 859
hotzst Avatar asked Jan 28 '17 07:01

hotzst


Video Answer


2 Answers

In short, yes you should use HashSet. It might not be the most possibly efficient Set implementation, but that hardly ever matters, unless you are working with huge amounts of data.

In that case, I would suggest using specialized libraries. EnumMaps if you can use enums, primitive maps like Trove if your data is mostly primitives, a bunch of other data-structures that are optimized for certain data-types, or even an in-memory-database.

Don't get me wrong, I'm someone who likes performance-tuning, too, but replacing the built-in data-structures should only be done when its really necessary. For most cases, they work perfectly fine.

What you could do, in case you really want to save the last bit of memory and do not care about inserting, is using a fixed-sized array, sorting that and doing a binary search every time. But I doubt that it's more efficient than a HashSet.

like image 162
Silverclaw Avatar answered Sep 29 '22 06:09

Silverclaw


Hashtables and HashSets should be used entirely different, so maybe the two shouldn't be compared as "which is more efficient". The hashset would be more suitable for the mathematical "set" (ex. {1,2,3,4}). They contain no duplicates and allow for only one null value. While a hashmap is more of a key-> pair value system. They allow multiple null values as well as duplicates, just not duplicate key vales. I know this is probably answering "difference between a hashtable and hashset" but I think my point is they really can't be compared.

like image 38
Ash_s94 Avatar answered Sep 29 '22 08:09

Ash_s94