Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell space leak in hash table insertion

I have been coding a histogram and I have had some great help on here. I have been coding my histogram using a hash table to store the keys and frequency values because the distribution of the keys are unknown; so they might not be sorted or consecutively together.

The problem with my code is that it spends too much time in GC so looks like a space leak as the time spent in GC is 60.3% - so my productivity is a poor 39.7%.

What is going wrong? I have tried to make things strict in the histogram function and I've also in-lined it (GC time went from 69.1% to 59.4%.)

Please note I have simplified this code by not updating the frequencies in the HT.

{-# LANGUAGE BangPatterns #-}
import qualified Data.HashTable.IO as H
import qualified Data.Vector as V

type HashTable k v = H.BasicHashTable k v

n :: Int 
n = 5000000

kv :: V.Vector (Int,Int)
kv = V.zip k v 
 where
    k = V.generate n (\i -> i `mod` 10)
    v = V.generate n (\i -> 1)

histogram :: V.Vector (Int,Int) -> Int -> IO (H.CuckooHashTable Int Int)
histogram vec !n = do
    ht <- H.newSized n 
    go ht (n-1)
        where
            go ht = go'
                where 
                    go' (-1) = return ht
                    go' !i = do
                        let (k,v) = vec V.! i
                        H.insert ht k v
                        go' (i-1)
{-# INLINE histogram #-}

main :: IO ()
main = do
    ht <- histogram kv n
    putStrLn "done"

Here's how it is compiled:

ghc --make -O3 -fllvm -rtsopts histogram.hs

Diagnosis:

jap@devbox:~/dev$ ./histogram +RTS -sstderr
done
     863,187,472 bytes allocated in the heap
     708,960,048 bytes copied during GC
     410,476,592 bytes maximum residency (5 sample(s))
       4,791,736 bytes maximum slop
             613 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1284 colls,     0 par    0.46s    0.46s     0.0004s    0.0322s
  Gen  1         5 colls,     0 par    0.36s    0.36s     0.0730s    0.2053s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.51s  (  0.50s elapsed)
  GC      time    0.82s  (  0.82s elapsed)
  EXIT    time    0.03s  (  0.04s elapsed)
  Total   time    1.36s  (  1.36s elapsed)

  %GC     time      60.3%  (60.4% elapsed)

  Alloc rate    1,708,131,822 bytes per MUT second

  Productivity  39.7% of total user, 39.7% of total elapsed
like image 700
jap Avatar asked Apr 18 '14 22:04

jap


1 Answers

For the sake of comparison, this is what I get running your code as posted:

     863,187,472 bytes allocated in the heap
     708,960,048 bytes copied during GC
     410,476,592 bytes maximum residency (5 sample(s))
       4,791,736 bytes maximum slop
             613 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1284 colls,     0 par    1.01s    1.01s     0.0008s    0.0766s
  Gen  1         5 colls,     0 par    0.81s    0.81s     0.1626s    0.4783s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.04s  (  1.04s elapsed)
  GC      time    1.82s  (  1.82s elapsed)
  EXIT    time    0.04s  (  0.04s elapsed)
  Total   time    2.91s  (  2.91s elapsed)

  %GC     time      62.6%  (62.6% elapsed)

  Alloc rate    827,493,210 bytes per MUT second

  Productivity  37.4% of total user, 37.4% of total elapsed

Given that your vector elements are just (Int, Int) tuples, we have no reason not to use Data.Vector.Unboxed instead of plain Data.Vector. That already leads to significant improvement:

     743,148,592 bytes allocated in the heap
          38,440 bytes copied during GC
     231,096,768 bytes maximum residency (4 sample(s))
       4,759,104 bytes maximum slop
             226 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       977 colls,     0 par    0.23s    0.23s     0.0002s    0.0479s
  Gen  1         4 colls,     0 par    0.22s    0.22s     0.0543s    0.1080s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.04s  (  1.04s elapsed)
  GC      time    0.45s  (  0.45s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.49s  (  1.49s elapsed)

  %GC     time      30.2%  (30.2% elapsed)

  Alloc rate    715,050,070 bytes per MUT second

  Productivity  69.8% of total user, 69.9% of total elapsed

Next, instead of hand-rolling recursion over the vector, we might use the optimised functions the vector library provides for that purpose. Code...

import qualified Data.HashTable.IO as H
import qualified Data.Vector.Unboxed as V

n :: Int
n = 5000000

kv :: V.Vector (Int,Int)
kv = V.zip k v
 where
    k = V.generate n (\i -> i `mod` 10)
    v = V.generate n (\i -> 1)

histogram :: V.Vector (Int,Int) -> Int -> IO (H.CuckooHashTable Int Int)
histogram vec n = do
    ht <- H.newSized n
    V.mapM_ (\(k, v) -> H.insert ht k v) vec
    return ht
{-# INLINE histogram #-}

main :: IO ()
main = do
    ht <- histogram kv n
    putStrLn "done"

... and result:

     583,151,048 bytes allocated in the heap
          35,632 bytes copied during GC
     151,096,672 bytes maximum residency (3 sample(s))
       3,003,040 bytes maximum slop
             148 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       826 colls,     0 par    0.20s    0.20s     0.0002s    0.0423s
  Gen  1         3 colls,     0 par    0.12s    0.12s     0.0411s    0.1222s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.92s  (  0.92s elapsed)
  GC      time    0.32s  (  0.33s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.25s  (  1.25s elapsed)

  %GC     time      25.9%  (26.0% elapsed)

  Alloc rate    631,677,209 bytes per MUT second

  Productivity  74.1% of total user, 74.0% of total elapsed

81MB saved, not bad at all. Can we do even better?

A heap profile (which should be the first thing you think of when having memory consumption woes - debugging them without one is shooting in the dark) will reveal that, even with the original code, peak memory consumption happens very early on. Strictly speaking we do not have a leak; we just spend a lot of memory from the beginning. Now, note that the hash table is created with ht <- H.newSized n, with n = 5000000. Unless you expect to have so many different keys (as opposed to elements), that is extremely wasteful. Changing the initial size to 10 (the number of keys you actually have in your test) improves things dramatically:

     432,059,960 bytes allocated in the heap
          50,200 bytes copied during GC
          44,416 bytes maximum residency (2 sample(s))
          25,216 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       825 colls,     0 par    0.01s    0.01s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0002s    0.0003s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.90s  (  0.90s elapsed)
  GC      time    0.01s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.91s  (  0.90s elapsed)

  %GC     time       0.6%  (0.6% elapsed)

  Alloc rate    481,061,802 bytes per MUT second

  Productivity  99.4% of total user, 99.4% of total elapsed

Finally, we might as well make our life simpler and try using the pure, yet efficient, hash map from unordered-containers. Code...

import qualified Data.HashMap.Strict as M
import qualified Data.Vector.Unboxed as V

n :: Int
n = 5000000

kv :: V.Vector (Int,Int)
kv = V.zip k v
 where
    k = V.generate n (\i -> i `mod` 10)
    v = V.generate n (\i -> 1)

histogram :: V.Vector (Int,Int) -> M.HashMap Int Int
histogram vec =
    V.foldl' (\ht (k, v) -> M.insert k v ht) M.empty vec

main :: IO ()
main = do
    print $ M.size $ histogram kv
    putStrLn "done"

... and result.

          55,760 bytes allocated in the heap
           3,512 bytes copied during GC
          44,416 bytes maximum residency (1 sample(s))
          17,024 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.34s  (  0.34s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.34s  (  0.34s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    162,667 bytes per MUT second

  Productivity  99.9% of total user, 100.0% of total elapsed

~60% faster. It remains to be seen how it would scale with a larger amount of keys, but with your test data unordered-containers ends up being not only more convenient (pure functions; actually updating the histogram values only takes changing M.insert to M.insertWith) but also faster.

like image 76
duplode Avatar answered Nov 16 '22 22:11

duplode