I want to count the number of occurrences of each character in a large file. Although I know that counting should be implemented in a strict way in Haskell (which I tried to achieve using foldl'), I am still running out of memory. For comparison: the file size is about 2GB, while the computer has 100GB of memory. There are not many different characters in that file - maybe 20. What am I doing wrong?
ins :: [(Char,Int)] -> Char -> [(Char,Int)]
ins [] c = [(c,1)]
ins ((c,i):cs) d
| c == d = (c,i+1):cs
| otherwise = (c,i) : ins cs d
main = do
[file] <- getArgs
txt <- readFile file
print $ foldl' ins [] txt
Your ins
function is creating tons of thunks that cause a lot of memory leak. foldl'
only evaluates to weak head normal form which is not enough here. What you need is deepseq
from Control.DeepSeq
in order to get to normal form.
Alternatively, instead of association list, use Data.Map.Strict
for counting. Also, If your IO is on the order of 2GB, you better use lazy ByteString
instead of plain strings.
Bellow code should perform in constant memory space regardless of size of input:
import System.Environment (getArgs)
import Data.Map.Strict (empty, alter)
import qualified Data.ByteString.Lazy.Char8 as B
main :: IO ()
main = getArgs >>= B.readFile . head >>= print . B.foldl' go empty
where
go = flip $ alter inc
inc :: Maybe Int -> Maybe Int
inc Nothing = Just 1
inc (Just i) = Just $ i + 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With