I have been coding a histogram and I have had some great help on here. I have been coding my histogram using a hash table to store the keys and frequency values because the distribution of the keys are unknown; so they might not be sorted or consecutively together.
The problem with my code is that it spends too much time in GC so looks like a space leak as the time spent in GC is 60.3% - so my productivity is a poor 39.7%.
What is going wrong? I have tried to make things strict in the histogram function and I've also in-lined it (GC time went from 69.1% to 59.4%.)
Please note I have simplified this code by not updating the frequencies in the HT.
{-# LANGUAGE BangPatterns #-}
import qualified Data.HashTable.IO as H
import qualified Data.Vector as V
type HashTable k v = H.BasicHashTable k v
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 10)
v = V.generate n (\i -> 1)
histogram :: V.Vector (Int,Int) -> Int -> IO (H.CuckooHashTable Int Int)
histogram vec !n = do
ht <- H.newSized n
go ht (n-1)
where
go ht = go'
where
go' (-1) = return ht
go' !i = do
let (k,v) = vec V.! i
H.insert ht k v
go' (i-1)
{-# INLINE histogram #-}
main :: IO ()
main = do
ht <- histogram kv n
putStrLn "done"
Here's how it is compiled:
ghc --make -O3 -fllvm -rtsopts histogram.hs
Diagnosis:
jap@devbox:~/dev$ ./histogram +RTS -sstderr
done
863,187,472 bytes allocated in the heap
708,960,048 bytes copied during GC
410,476,592 bytes maximum residency (5 sample(s))
4,791,736 bytes maximum slop
613 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1284 colls, 0 par 0.46s 0.46s 0.0004s 0.0322s
Gen 1 5 colls, 0 par 0.36s 0.36s 0.0730s 0.2053s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.51s ( 0.50s elapsed)
GC time 0.82s ( 0.82s elapsed)
EXIT time 0.03s ( 0.04s elapsed)
Total time 1.36s ( 1.36s elapsed)
%GC time 60.3% (60.4% elapsed)
Alloc rate 1,708,131,822 bytes per MUT second
Productivity 39.7% of total user, 39.7% of total elapsed
For the sake of comparison, this is what I get running your code as posted:
863,187,472 bytes allocated in the heap
708,960,048 bytes copied during GC
410,476,592 bytes maximum residency (5 sample(s))
4,791,736 bytes maximum slop
613 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1284 colls, 0 par 1.01s 1.01s 0.0008s 0.0766s
Gen 1 5 colls, 0 par 0.81s 0.81s 0.1626s 0.4783s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.04s ( 1.04s elapsed)
GC time 1.82s ( 1.82s elapsed)
EXIT time 0.04s ( 0.04s elapsed)
Total time 2.91s ( 2.91s elapsed)
%GC time 62.6% (62.6% elapsed)
Alloc rate 827,493,210 bytes per MUT second
Productivity 37.4% of total user, 37.4% of total elapsed
Given that your vector elements are just (Int, Int)
tuples, we have no reason not to use Data.Vector.Unboxed
instead of plain Data.Vector
. That already leads to significant improvement:
743,148,592 bytes allocated in the heap
38,440 bytes copied during GC
231,096,768 bytes maximum residency (4 sample(s))
4,759,104 bytes maximum slop
226 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 977 colls, 0 par 0.23s 0.23s 0.0002s 0.0479s
Gen 1 4 colls, 0 par 0.22s 0.22s 0.0543s 0.1080s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.04s ( 1.04s elapsed)
GC time 0.45s ( 0.45s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.49s ( 1.49s elapsed)
%GC time 30.2% (30.2% elapsed)
Alloc rate 715,050,070 bytes per MUT second
Productivity 69.8% of total user, 69.9% of total elapsed
Next, instead of hand-rolling recursion over the vector, we might use the optimised functions the vector
library provides for that purpose. Code...
import qualified Data.HashTable.IO as H
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 10)
v = V.generate n (\i -> 1)
histogram :: V.Vector (Int,Int) -> Int -> IO (H.CuckooHashTable Int Int)
histogram vec n = do
ht <- H.newSized n
V.mapM_ (\(k, v) -> H.insert ht k v) vec
return ht
{-# INLINE histogram #-}
main :: IO ()
main = do
ht <- histogram kv n
putStrLn "done"
... and result:
583,151,048 bytes allocated in the heap
35,632 bytes copied during GC
151,096,672 bytes maximum residency (3 sample(s))
3,003,040 bytes maximum slop
148 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 826 colls, 0 par 0.20s 0.20s 0.0002s 0.0423s
Gen 1 3 colls, 0 par 0.12s 0.12s 0.0411s 0.1222s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.92s ( 0.92s elapsed)
GC time 0.32s ( 0.33s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.25s ( 1.25s elapsed)
%GC time 25.9% (26.0% elapsed)
Alloc rate 631,677,209 bytes per MUT second
Productivity 74.1% of total user, 74.0% of total elapsed
81MB saved, not bad at all. Can we do even better?
A heap profile (which should be the first thing you think of when having memory consumption woes - debugging them without one is shooting in the dark) will reveal that, even with the original code, peak memory consumption happens very early on. Strictly speaking we do not have a leak; we just spend a lot of memory from the beginning. Now, note that the hash table is created with ht <- H.newSized n
, with n = 5000000
. Unless you expect to have so many different keys (as opposed to elements), that is extremely wasteful. Changing the initial size to 10 (the number of keys you actually have in your test) improves things dramatically:
432,059,960 bytes allocated in the heap
50,200 bytes copied during GC
44,416 bytes maximum residency (2 sample(s))
25,216 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 825 colls, 0 par 0.01s 0.01s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.90s ( 0.90s elapsed)
GC time 0.01s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.91s ( 0.90s elapsed)
%GC time 0.6% (0.6% elapsed)
Alloc rate 481,061,802 bytes per MUT second
Productivity 99.4% of total user, 99.4% of total elapsed
Finally, we might as well make our life simpler and try using the pure, yet efficient, hash map from unordered-containers
. Code...
import qualified Data.HashMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 10)
v = V.generate n (\i -> 1)
histogram :: V.Vector (Int,Int) -> M.HashMap Int Int
histogram vec =
V.foldl' (\ht (k, v) -> M.insert k v ht) M.empty vec
main :: IO ()
main = do
print $ M.size $ histogram kv
putStrLn "done"
... and result.
55,760 bytes allocated in the heap
3,512 bytes copied during GC
44,416 bytes maximum residency (1 sample(s))
17,024 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.34s ( 0.34s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.34s ( 0.34s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 162,667 bytes per MUT second
Productivity 99.9% of total user, 100.0% of total elapsed
~60% faster. It remains to be seen how it would scale with a larger amount of keys, but with your test data unordered-containers
ends up being not only more convenient (pure functions; actually updating the histogram values only takes changing M.insert
to M.insertWith
) but also faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With