Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fixing a particularly obscure Haskell space leak

So after having squeezed the last bit of performance out of some Haskell I am using to break tweet data into n-grams, I'm running up against a space leak problem. When I profile, the GC uses about 60-70% of the process and there are significant memory portions dedicated to drag. Hopefully, some Haskell guru will be able to suggest when I'm going wrong.

{-# LANGUAGE OverloadedStrings, BangPatterns #-}
import Data.Maybe
import qualified Data.ByteString.Char8 as B
import qualified Data.HashMap.Strict as H
import Text.Regex.Posix
import Data.List
import qualified Data.Char as C

isClassChar a = C.isAlphaNum a || a == ' ' || a == '\'' ||
                a == '-' || a == '#' || a == '@' || a == '%'

cullWord :: B.ByteString -> B.ByteString
cullWord w = B.map C.toLower $ B.filter isClassChar w

procTextN :: Int -> B.ByteString -> [([B.ByteString],Int)]
procTextN n t = H.toList $ foldl' ngram H.empty lines
  where !lines        = B.lines $ cullWord t
        ngram tr line = snd $ foldl' breakdown (base,tr) (B.split ' ' line)
        base          = replicate (n-1) ""

breakdown :: ([B.ByteString], H.HashMap [B.ByteString] Int) -> B.ByteString -> ([B.ByteString],H.HashMap [B.ByteString] Int)
breakdown (st@(s:ss),tree) word =
  newStack `seq` expandedWord `seq` (newStack,expandedWord)
      where newStack     = ss ++ [word]
            expandedWord = updateWord (st ++ [word]) tree

updateWord :: [B.ByteString] -> H.HashMap [B.ByteString] Int -> H.HashMap [B.ByteString] Int
updateWord w h = H.insertWith (+) w 1 h

main = do
  test2 <- B.readFile "canewobble"
  print $ filter (\(a,b) -> b > 100) $ 
     sortBy (\(a,b) (c,d) -> compare d b) $ procTextN 3 test2
like image 246
Erik Hinton Avatar asked Oct 21 '11 21:10

Erik Hinton


1 Answers

One small optimisation is to filter the data (using HashMap.filter) before sorting it. This helped me shave 2s off the final execution time. Another thing I've done was to use sequences (Data.Sequence) instead of lists (no noticeable difference :-( ). My version can be found here.

Looking at the heap profile, I don't think that there is a space leak in your program: Space profile

You're just building a fairly large hash table (377141 key-value pairs) in memory, and then discard it after some processing. According to Johan's post, a hash table of this size takes approx. 5*N + 4*(N-1) words = 3394265*4 bytes ~= 13 MiB, which agrees with what the heap profile shows. The remaining space is taken by the keys and values. On my machine, time spent in GC is around 40%, which doesn't sound unreasonable given that you're constantly updating the hash table and the temporary "stacks", while not doing anything computation-heavy with the data. Since the only operation that you need the hash table for is insertWith, perhaps it's better to use a mutable data structure?

Update: I've rewritten your program using a mutable hash table. Interestingly, the speed difference is not large, but memory usage is slightly better:

enter image description here

As you can see, the size of the block allocated for the hash table stays constant throughout the execution.

like image 102
Mikhail Glushenkov Avatar answered Oct 28 '22 18:10

Mikhail Glushenkov