Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Data.Set require elements to be an instance of Ord?

Tags:

This doesn't work

data Cutlery = Knife | Fork deriving (Show,Eq)
let x = [Knife,Fork]
let set1 = Set.fromList x

while defining

data Cutlery = Knife | Fork deriving (Show,Ord,Eq)

solves the issue but doesn't make sense. Is Data.Set different than the mathematical definition of a set?

like image 916
Codey McCodeface Avatar asked Jan 02 '14 10:01

Codey McCodeface


2 Answers

A Data.Set captures the mathematical abstraction of a set, but it is not identical. The main difference is that a Data.Set requires its elements to be ordered, whereas a mathematical set only requires that its elements be comparable for equality.

The reason for requiring Ord is efficiency. It would be perfectly possible to build a set abstraction by defining

data Set a = Set [a]

i.e. under the hood it is just a list, and we make sure we never insert duplicate elements. The elem and insert operations would be

elem a (Set as) = any (a ==) as

insert a (Set as) | a `elem` as = Set as
                  | otherwise   = Set (a:as)

However, this means that both elem and insert are O(n) operations. If we want to do any better than this, the standard approaches are

  1. Store the elements in a balanced binary tree (which requires an Ord instance)
  2. Hash the elements and store them in an array (which requires a Hashable instance).

TreeSet

The implementation chosen by the authors of Data.Set was to use a binary tree, which you can see by going to the source. The implementation is more or less

data Set a = Bin a (Set a) (Set a)
           | Tip

Now you can write the elem function as

elem :: Ord a => a -> Set a -> Bool
elem = go
  where
    go _  Tip = False
    go x (Bin y l r) = case compare x y of
      LT -> go x l
      GT -> go x r
      EQ -> True

which is an O(log n) operation, rather than O(n). Insertions are trickier (as you need to keep the tree balanced) but similar.

HashSet

In a hash set, you don't directly compare elements when inserting and removing them. Instead, each element is hashed to an integer, and stored in a location based on that integer.

In theory this doesn't require an Ord instance. In practice, you need some method of keeping track of multiple elements that hash to the same value, and the method chosen by the developers of Data.HashSet is to store multiple elements in a regular Data.Set, so it turns out you do need the Ord instance after all!

data HashSet a = HashSet (Data.IntMap.IntMap (Data.Set.Set a))

It could have been written instead as

data HashSet a = HashSet (Data.IntMap.IntMap [a])

instead, which removes out the Ord requirement at the cost of some inefficiency if there are many elements which has to the same value.

like image 179
Chris Taylor Avatar answered Oct 25 '22 01:10

Chris Taylor


Is Data.Set different than the mathematical definition of a set?

Obviously, mathematical sets can be uncountable infinite – you won’t be able to represent that in all generality with a computer, or even a Turing machine.

But the answer you are looking for is this: Data.Set is a data type based on binary trees, and needs a total linear order on the elements to know whether to put and later find something on the left or right subtree of a node. So while it would be possible to implement a set datatype without an Ord constraint, this particular, more efficient implementation would not.

like image 30
Joachim Breitner Avatar answered Oct 25 '22 03:10

Joachim Breitner