I read that hash tables in Haskell had performance issues (on the Haskell-Cafe in 2006 and Flying Frog Consultancy's blog in 2009), and since I like Haskell it worried me. That was a year ago, what is the status now (June 2010)? Has the "hash table problem" been fixed in GHC?

The problem was that the garbage collector is required to traverse mutable arrays of pointers ("boxed arrays") looking for pointers to data that might be ready to deallocate. Boxed, mutable arrays are the main mechanism for implementing a hashtable, so that particular structure showed up the GC traversal issue. This is common to many languages. The symptom is excessive garbage collection (up to 95% of time spent in GC). The fix was to implement "card marking" in the GC for mutable arrays of pointers, which occured in late 2009. You shouldn't see excessive GC when using mutable arrays of pointers in Haskell now. On the simple benchmarks, hashtable insertion for large hashes improved by 10x. Note that the GC walking issue doesn't affect purely functional structures, nor unboxed arrays (like most data parallel arrays, or vector-like arrays, in Haskell. Nor does it affect hashtables stored on the C heap (like judy). Meaning that it didn't affect day-to-day Haskellers not using imperative hash tables. If you are using hashtables in Haskell, you shouldn't observe any issue now. Here, for example, is a simple hashtable program that inserts 10 million ints into a hash. I'll do the benchmarking, since the original citation doesn't present any code or benchmarks. <pre class="prettyprint"><code>import Control.Monad import qualified Data.HashTable as H import System.Environment main = do [size] <- fmap (fmap read) getArgs m <- H.new (==) H.hashInt forM_ [1..size] $ \n -> H.insert m n n v <- H.lookup m 100 print v </code></pre> With GHC 6.10.2, before the fix, inserting 10M ints: <pre class="prettyprint"><code>$ time ./A 10000000 +RTS -s ... 47s. </code></pre> With GHC 6.13, after the fix: <pre class="prettyprint"><code>./A 10000000 +RTS -s ... 8s </code></pre> Increasing the default heap area: <pre class="prettyprint"><code>./A +RTS -s -A2G ... 2.3s </code></pre> Avoiding hashtables and using an IntMap: <pre class="prettyprint"><code>import Control.Monad import Data.List import qualified Data.IntMap as I import System.Environment main = do [size] <- fmap (fmap read) getArgs let k = foldl' (\m n -> I.insert n n m) I.empty [1..size] print $ I.lookup 100 k </code></pre> And we get: <pre class="prettyprint"><code>$ time ./A 10000000 +RTS -s ./A 10000000 +RTS -s 6s </code></pre> Or, alternatively, using a judy array (which is a Haskell wrapper calling C code through the foreign-function interface): <pre class="prettyprint"><code>import Control.Monad import Data.List import System.Environment import qualified Data.Judy as J main = do [size] <- fmap (fmap read) getArgs j <- J.new :: IO (J.JudyL Int) forM_ [1..size] $ \n -> J.insert (fromIntegral n) n j print =<< J.lookup 100 j </code></pre> Running this, <pre class="prettyprint"><code>$ time ./A 10000000 +RTS -s ... 2.1s </code></pre> So, as you can see, the GC issue with hashtables is fixed, and there have always been other libraries and data structures which were perfectly suitable. In summary, this is a non-issue. Note: as of 2013, you should probably just use the hashtables package, which supports a range of mutable hashtables natively.

Curious about the HashTable performance issues

1 Answers

The problem was that the garbage collector is required to traverse mutable arrays of pointers ("boxed arrays") looking for pointers to data that might be ready to deallocate. Boxed, mutable arrays are the main mechanism for implementing a hashtable, so that particular structure showed up the GC traversal issue. This is common to many languages. The symptom is excessive garbage collection (up to 95% of time spent in GC).

The fix was to implement "card marking" in the GC for mutable arrays of pointers, which occured in late 2009. You shouldn't see excessive GC when using mutable arrays of pointers in Haskell now. On the simple benchmarks, hashtable insertion for large hashes improved by 10x.

Note that the GC walking issue doesn't affect purely functional structures, nor unboxed arrays (like most data parallel arrays, or vector-like arrays, in Haskell. Nor does it affect hashtables stored on the C heap (like judy). Meaning that it didn't affect day-to-day Haskellers not using imperative hash tables.

If you are using hashtables in Haskell, you shouldn't observe any issue now. Here, for example, is a simple hashtable program that inserts 10 million ints into a hash. I'll do the benchmarking, since the original citation doesn't present any code or benchmarks.

import Control.Monad import qualified Data.HashTable as H import System.Environment  main = do   [size] <- fmap (fmap read) getArgs   m <- H.new (==) H.hashInt   forM_ [1..size] $ \n -> H.insert m n n   v <- H.lookup m 100   print v

With GHC 6.10.2, before the fix, inserting 10M ints:

$ time ./A 10000000 +RTS -s ... 47s.

With GHC 6.13, after the fix:

./A 10000000 +RTS -s  ... 8s

Increasing the default heap area:

./A +RTS -s -A2G ... 2.3s

Avoiding hashtables and using an IntMap:

import Control.Monad import Data.List import qualified Data.IntMap as I import System.Environment  main = do   [size] <- fmap (fmap read) getArgs   let k = foldl' (\m n -> I.insert n n m) I.empty [1..size]   print $ I.lookup 100 k

And we get:

$ time ./A 10000000 +RTS -s         ./A 10000000 +RTS -s 6s

Or, alternatively, using a judy array (which is a Haskell wrapper calling C code through the foreign-function interface):

import Control.Monad import Data.List import System.Environment import qualified Data.Judy as J  main = do   [size] <- fmap (fmap read) getArgs   j <- J.new :: IO (J.JudyL Int)   forM_ [1..size] $ \n -> J.insert (fromIntegral n) n j   print =<< J.lookup 100 j

Running this,

$ time ./A 10000000 +RTS -s ... 2.1s

So, as you can see, the GC issue with hashtables is fixed, and there have always been other libraries and data structures which were perfectly suitable. In summary, this is a non-issue.

Note: as of 2013, you should probably just use the hashtables package, which supports a range of mutable hashtables natively.

184

answered Sep 22 '22 18:09

Don Stewart

Related questions
                            
                                What's the closest thing to Haskell's typeclasses in OCaml?
                            
                                Can I define the Negatable interface in Java?
                            
                                Simplest non-trivial monad transformer example for "dummies", IO+Maybe
                            
                                What is a "spark" in Haskell
                            
                                What does the : infix operator do in Haskell?
                            
                                Can you overload + in haskell?
                            
                                Why is the F# version of this program 6x faster than the Haskell one?
                            
                                Excel Automation with Haskell gives a seg fault
                            
                                values, types, kinds,... as an infinite sequence?
                            
                                Monad Transformers vs Passing parameters to functions
                            
                                Simple haskell unit testing
                            
                                Testing IO actions with Monadic QuickCheck
                            
                                Web Scraping With Haskell
                            
                                What's the difference between undefined in Haskell and null in Java?
                            
                                Why are difference lists more efficient than regular concatenation in Haskell?
                            
                                Creative uses of monads
                            
                                In what sense is the IO Monad pure?
                            
                                Avoiding lift with monad transformers
                            
                                ghci configuration file
                            
                                Haskell guards on lambda functions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Curious about the HashTable performance issues

Tags:

hashtable

haskell

ghc

Alessandro Stamatto

People also ask

1 Answers

Don Stewart

Recent Activity

Donate For Us