Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should a HashSet be allowed to be added to itself in Java?

According to the contract for a Set in Java, "it is not permissible for a set to contain itself as an element" (source). However, this is possible in the case of a HashSet of Objects, as demonstrated here:

Set<Object> mySet = new HashSet<>();
mySet.add(mySet);
assertThat(mySet.size(), equalTo(1));

This assertion passes, but I would expect the behavior to be to either have the resulting set be 0 or to throw an Exception. I realize the underlying implementation of a HashSet is a HashMap, but it seems like there should be an equality check before adding an element to avoid violating that contract, no?

like image 901
davidmerrick Avatar asked Apr 19 '18 15:04

davidmerrick


People also ask

What does HashSet add do in Java?

HashSet add() Method in Java util. HashSet. add() method in Java HashSet is used to add a specific element into a HashSet. This method will add the element only if the specified element is not present in the HashSet else the function will return False if the element is already present in the HashSet.

When should HashSet be used?

A HashSet is usually used for high-performance operations involving a set of unique data. Since HashSet contains only unique elements, its internal structure is optimized for faster searches. Note that you can store a single null value in a HashSet.

When we should use Set or HashSet in Java?

Java HashSet class is used to create a collection that uses a hash table for storage. It inherits the AbstractSet class and implements Set interface. The important points about Java HashSet class are: HashSet stores the elements by using a mechanism called hashing.

Why should you use HashSet Why does it store unique values?

It has unique elements are does not guarantee order or sorting. HashSet uses buckets to store data and hence most of the operations are constant in time. You can use HashSets when you want to do de-duplication of elements or store elements in away where you don't want to retrieve a specific element in specific order.


4 Answers

Others have already pointed out why it is questionable from a mathematical point of view, by referring to Russell's paradox.

This does not answer your question on a technical level, though.

So let's dissect this:

First, once more the relevant part from the JavaDoc of the Set interface:

Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.

Interestingly, the JavaDoc of the List interface makes a similar, although somewhat weaker, and at the same time more technical statement:

While it is permissible for lists to contain themselves as elements, extreme caution is advised: the equals and hashCode methods are no longer well defined on such a list.

And finally, the crux is in the JavaDoc of the Collection interface, which is the common ancestor of both the Set and the List interface:

Some collection operations which perform recursive traversal of the collection may fail with an exception for self-referential instances where the collection directly or indirectly contains itself. This includes the clone(), equals(), hashCode() and toString() methods. Implementations may optionally handle the self-referential scenario, however most current implementations do not do so.

(Emphasis by me)

The bold part is a hint at why the approach that you proposed in your question would not be sufficient:

it seems like there should be an equality check before adding an element to avoid violating that contract, no?

This would not help you here. The key point is that you'll always run into problems when the collection will directly or indirectly contain itself. Imagine this scenario:

Set<Object> setA = new HashSet<Object>();
Set<Object> setB = new HashSet<Object>();
setA.add(setB);
setB.add(setA);

Obviously, neither of the sets contains itself directly. But each of them contains the other - and therefore, itself indirectly. This could not be avoided by a simple referential equality check (using == in the add method).


Avoiding such an "inconsistent state" is basically impossible in practice. Of course it is possible in theory, using referential Reachability computations. In fact, the Garbage Collector basically has to do exactly that!

But it becomes impossible in practice when custom classes are involved. Imagine a class like this:

class Container {

    Set<Object> set;

    @Override 
    int hashCode() {
        return set.hashCode(); 
    }
}

And messing around with this and its set:

Set<Object> set = new HashSet<Object>();
Container container = new Container();
container.set = set;
set.add(container);

The add method of the Set basically has no way of detecting whether the object that is added there has some (indirect) reference to the set itself.

Long story short:

You cannot prevent the programmer from messing things up.

like image 155
Marco13 Avatar answered Oct 21 '22 15:10

Marco13


Adding the collection into itself once causes the test to pass. Adding it twice causes the StackOverflowError which you were seeking.

From a personal developer standpoint, it doesn't make any sense to enforce a check in the underlying code to prevent this. The fact that you get a StackOverflowError in your code if you attempt to do this too many times, or calculate the hashCode - which would cause an instant overflow - should be enough to ensure that no sane developer would keep this kind of code in their code base.

like image 21
Makoto Avatar answered Oct 21 '22 15:10

Makoto


You need to read the full doc and quote it fully:

The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.

The actual restriction is in the first sentence. The behavior is unspecified if an element of a set is mutated.

Since adding a set to itself mutates it, and adding it again mutates it again, the result is unspecified.

Note that the restriction is that the behavior is unspecified, and that a special case of that restriction is adding the set to itself.

So the doc says, in other words, that adding a set to itself results in unspecified behavior, which is what you are seeing. It's up to the concrete implementation to deal with (or not).

like image 23
StandByUkraine Avatar answered Oct 21 '22 14:10

StandByUkraine


I agree with you that, from a mathematical perspective, this behavior really doesn't make sense.

There are two interesting questions here: first, to what extent were the designers of the Set interface trying to implement a mathematical set? Secondly, even if they weren't, to what extent does that exempt them from the rules of set theory?

For the first question, I will point you to the documentation of the Set:

A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.

It's worth mentioning here that current formulations of set theory don't permit sets to be members of themselves. (See the Axiom of regularity). This is due in part to Russell's Paradox, which exposed a contradiction in naive set theory (which permitted a set to be any collection of objects - there was no prohibition against sets including themselves). This is often illustrated by the Barber Paradox: suppose that, in a particular town, a barber shaves all of the men - and only the men - who do not shave themselves. Question: does the barber shave himself? If he does, it violates the second constraint; if he doesn't, it violates the first constraint. This is clearly logically impossible, but it's actually perfectly permissible under the rules of naive set theory (which is why the newer "standard" formulation of set theory explicitly bans sets from containing themselves).

There's more discussion in this question on Math.SE about why sets cannot be an element of themselves.

With that said, this brings up the second question: even if the designers hadn't been explicitly trying to model a mathematical set, would this be completely "exempt" from the problems associated with naive set theory? I think not - I think that many of the problems that plagued naive set theory would plague any kind of a collection that was insufficiently constrained in ways that were analogous to naive set theory. Indeed, I may be reading too much into this, but the first part of the definition of a Set in the documentation sounds suspiciously like the intuitive concept of a set in naive set theory:

A collection that contains no duplicate elements.

Admittedly (and to their credit), they do place at least some constraints on this later (including stating that you really shouldn't try to have a Set contain itself), but you could question whether it's really "enough" to avoid the problems with naive set theory. This is why, for example, you have a "turtles all the way down" problem when trying to calculate the hash code of a HashSet that contains itself. This is not, as some others have suggested, merely a practical problem - it's an illustration of the fundamental theoretical problems with this type of formulation.

As a brief digression, I do recognize that there are, of course, some limitations on how closely any collection class can really model a mathematical set. For example, Java's documentation warns against the dangers of including mutable objects in a set. Some other languages, such as Python, at least attempt to ban many kinds of mutable objects entirely:

The set classes are implemented using dictionaries. Accordingly, the requirements for set elements are the same as those for dictionary keys; namely, that the element defines both __eq__() and __hash__(). As a result, sets cannot contain mutable elements such as lists or dictionaries. However, they can contain immutable collections such as tuples or instances of ImmutableSet. For convenience in implementing sets of sets, inner sets are automatically converted to immutable form, for example, Set([Set(['dog'])]) is transformed to Set([ImmutableSet(['dog'])]).

Two other major differences that others have pointed out are

  • Java sets are mutable
  • Java sets are finite. Obviously, this will be true of any collection class: apart from concerns about actual infinity, computers only have a finite amount of memory. (Some languages, like Haskell, have lazy infinite data structures; however, in my opinion, a lawlike choice sequence seems like a more natural way model these than classical set theory, but that's just my opinion).

TL;DR No, it really shouldn't be permitted (or, at least, you should never do that) because sets can't be members of themselves.

like image 10
EJoshuaS - Stand with Ukraine Avatar answered Oct 21 '22 13:10

EJoshuaS - Stand with Ukraine