Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running out of memory while counting characters in a large file

I want to count the number of occurrences of each character in a large file. Although I know that counting should be implemented in a strict way in Haskell (which I tried to achieve using foldl'), I am still running out of memory. For comparison: the file size is about 2GB, while the computer has 100GB of memory. There are not many different characters in that file - maybe 20. What am I doing wrong?

ins :: [(Char,Int)] -> Char -> [(Char,Int)]
ins [] c = [(c,1)]
ins ((c,i):cs) d
    | c == d = (c,i+1):cs
    | otherwise = (c,i) : ins cs d

main = do
    [file] <- getArgs
    txt <- readFile file
    print $ foldl' ins [] txt
like image 825
Dune Avatar asked Dec 23 '16 14:12

Dune


1 Answers

Your ins function is creating tons of thunks that cause a lot of memory leak. foldl' only evaluates to weak head normal form which is not enough here. What you need is deepseq from Control.DeepSeq in order to get to normal form.

Alternatively, instead of association list, use Data.Map.Strict for counting. Also, If your IO is on the order of 2GB, you better use lazy ByteString instead of plain strings.

Bellow code should perform in constant memory space regardless of size of input:

import System.Environment (getArgs)
import Data.Map.Strict (empty, alter)
import qualified Data.ByteString.Lazy.Char8 as B

main :: IO ()
main = getArgs >>= B.readFile . head >>= print . B.foldl' go empty
  where
  go = flip $ alter inc
  inc :: Maybe Int -> Maybe Int
  inc Nothing  = Just 1
  inc (Just i) = Just $ i + 1
like image 53
behzad.nouri Avatar answered Oct 19 '22 09:10

behzad.nouri