So after having squeezed the last bit of performance out of some Haskell I am using to break tweet data into n-grams, I'm running up against a space leak problem. When I profile, the GC uses about 60-70% of the process and there are significant memory portions dedicated to drag. Hopefully, some Haskell guru will be able to suggest when I'm going wrong.
{-# LANGUAGE OverloadedStrings, BangPatterns #-}
import Data.Maybe
import qualified Data.ByteString.Char8 as B
import qualified Data.HashMap.Strict as H
import Text.Regex.Posix
import Data.List
import qualified Data.Char as C
isClassChar a = C.isAlphaNum a || a == ' ' || a == '\'' ||
a == '-' || a == '#' || a == '@' || a == '%'
cullWord :: B.ByteString -> B.ByteString
cullWord w = B.map C.toLower $ B.filter isClassChar w
procTextN :: Int -> B.ByteString -> [([B.ByteString],Int)]
procTextN n t = H.toList $ foldl' ngram H.empty lines
where !lines = B.lines $ cullWord t
ngram tr line = snd $ foldl' breakdown (base,tr) (B.split ' ' line)
base = replicate (n-1) ""
breakdown :: ([B.ByteString], H.HashMap [B.ByteString] Int) -> B.ByteString -> ([B.ByteString],H.HashMap [B.ByteString] Int)
breakdown (st@(s:ss),tree) word =
newStack `seq` expandedWord `seq` (newStack,expandedWord)
where newStack = ss ++ [word]
expandedWord = updateWord (st ++ [word]) tree
updateWord :: [B.ByteString] -> H.HashMap [B.ByteString] Int -> H.HashMap [B.ByteString] Int
updateWord w h = H.insertWith (+) w 1 h
main = do
test2 <- B.readFile "canewobble"
print $ filter (\(a,b) -> b > 100) $
sortBy (\(a,b) (c,d) -> compare d b) $ procTextN 3 test2
One small optimisation is to filter the data (using HashMap.filter
) before sorting it. This helped me shave 2s off the final execution time. Another thing I've done was to use sequences (Data.Sequence
) instead of lists (no noticeable difference :-( ). My version can be found here.
Looking at the heap profile, I don't think that there is a space leak in your program:
You're just building a fairly large hash table (377141 key-value pairs) in memory, and then discard it after some processing. According to Johan's post, a hash table of this size takes approx. 5*N + 4*(N-1) words = 3394265*4 bytes ~= 13 MiB, which agrees with what the heap profile shows. The remaining space is taken by the keys and values. On my machine, time spent in GC is around 40%, which doesn't sound unreasonable given that you're constantly updating the hash table and the temporary "stacks", while not doing anything computation-heavy with the data. Since the only operation that you need the hash table for is insertWith
, perhaps it's better to use a mutable data structure?
Update: I've rewritten your program using a mutable hash table. Interestingly, the speed difference is not large, but memory usage is slightly better:
As you can see, the size of the block allocated for the hash table stays constant throughout the execution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With