Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Curious about the HashTable performance issues

I read that hash tables in Haskell had performance issues (on the Haskell-Cafe in 2006 and Flying Frog Consultancy's blog in 2009), and since I like Haskell it worried me.

That was a year ago, what is the status now (June 2010)? Has the "hash table problem" been fixed in GHC?

like image 354
Alessandro Stamatto Avatar asked Jun 17 '10 02:06

Alessandro Stamatto


People also ask

Why is Hashtable performance slow?

You are hitting memory paging limits. You have some dodgy constructors and destructors. You are making calls to it when you don't need to. Your hash function isn't right for your dataset, collapsing it into a list internally.

Why is Hashtable fast?

Searching over a data structure such as an array presents a linear time complexity of O(n). In other words, as the data structure increases in size, the search time increases in a linear fashion. Simply put, using a hash table is faster than searching through an array.

What is the purpose of a Hashtable?

Hash tables let us implement things like phone books or dictionaries; in them, we store the association between a value (like a dictionary definition of the word "lamp") and its key (the word "lamp" itself). We can use hash tables to store, retrieve, and delete data uniquely based on their unique key.

Why wouldn't you use a hash table?

There are some operations which are not efficiently supported by hash tables, such as iterating over all the elements whose keys are within a certain range, finding the element with the largest key or smallest key, and so on. The O(n) complexity is on average.


1 Answers

The problem was that the garbage collector is required to traverse mutable arrays of pointers ("boxed arrays") looking for pointers to data that might be ready to deallocate. Boxed, mutable arrays are the main mechanism for implementing a hashtable, so that particular structure showed up the GC traversal issue. This is common to many languages. The symptom is excessive garbage collection (up to 95% of time spent in GC).

The fix was to implement "card marking" in the GC for mutable arrays of pointers, which occured in late 2009. You shouldn't see excessive GC when using mutable arrays of pointers in Haskell now. On the simple benchmarks, hashtable insertion for large hashes improved by 10x.

Note that the GC walking issue doesn't affect purely functional structures, nor unboxed arrays (like most data parallel arrays, or vector-like arrays, in Haskell. Nor does it affect hashtables stored on the C heap (like judy). Meaning that it didn't affect day-to-day Haskellers not using imperative hash tables.

If you are using hashtables in Haskell, you shouldn't observe any issue now. Here, for example, is a simple hashtable program that inserts 10 million ints into a hash. I'll do the benchmarking, since the original citation doesn't present any code or benchmarks.

import Control.Monad import qualified Data.HashTable as H import System.Environment  main = do   [size] <- fmap (fmap read) getArgs   m <- H.new (==) H.hashInt   forM_ [1..size] $ \n -> H.insert m n n   v <- H.lookup m 100   print v 

With GHC 6.10.2, before the fix, inserting 10M ints:

$ time ./A 10000000 +RTS -s ... 47s. 

With GHC 6.13, after the fix:

./A 10000000 +RTS -s  ... 8s 

Increasing the default heap area:

./A +RTS -s -A2G ... 2.3s 

Avoiding hashtables and using an IntMap:

import Control.Monad import Data.List import qualified Data.IntMap as I import System.Environment  main = do   [size] <- fmap (fmap read) getArgs   let k = foldl' (\m n -> I.insert n n m) I.empty [1..size]   print $ I.lookup 100 k 

And we get:

$ time ./A 10000000 +RTS -s         ./A 10000000 +RTS -s 6s 

Or, alternatively, using a judy array (which is a Haskell wrapper calling C code through the foreign-function interface):

import Control.Monad import Data.List import System.Environment import qualified Data.Judy as J  main = do   [size] <- fmap (fmap read) getArgs   j <- J.new :: IO (J.JudyL Int)   forM_ [1..size] $ \n -> J.insert (fromIntegral n) n j   print =<< J.lookup 100 j 

Running this,

$ time ./A 10000000 +RTS -s ... 2.1s 

So, as you can see, the GC issue with hashtables is fixed, and there have always been other libraries and data structures which were perfectly suitable. In summary, this is a non-issue.

Note: as of 2013, you should probably just use the hashtables package, which supports a range of mutable hashtables natively.

like image 184
Don Stewart Avatar answered Sep 22 '22 18:09

Don Stewart