Processing a very large text file with lazy Texts and ByteStrings

Question

I'm trying to process a very large unicode text file (6GB+). What I want is to count the frequency of each unique word. I use a strict Data.Map to keep track of the counts of each word as I traverse the file. The process takes too much time and too much memory (20GB+). I suspect the Map is huge but I'm not sure it should reach 5x the size of the file! The code is shown below. Please note that I tried the following:

Using Data.HashMap.Strict instead of Data.Map.Strict. Data.Map seems to perform better in terms of slower memory consumption increase rate.

Reading the files using lazy ByteString instead of lazy Text. And then I encode it to Text do some processing and then encode it back to ByteString for IO.

import Data.Text.Lazy (Text(..), cons, pack, append)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import Data.Map.Strict hiding (foldr, map, foldl')
import System.Environment
import System.IO
import Data.Word

dictionate :: [Text] -> Map Text Word16
dictionate = fromListWith (+) . (`zip` [1,1..])

main = do
    [file,out] <- getArgs
    h <- openFile file ReadMode
    hO <- openFile out WriteMode
    mapM_ (flip hSetEncoding utf8) [h,hO]
    txt <- TI.hGetContents h
    TI.hPutStr hO . T.unlines . 
      map (uncurry ((. cons '	' . pack . show) . append)) . 
      toList . dictionate . T.words $ txt
    hFlush hO
    mapM_ hClose [h,hO]
    print "success"

What's wrong with my approach? What's the best way to accomplish what I'm trying to do in terms of time and memory performance?

Mikhail Glushenkov · Accepted Answer

This memory usage is expected. Data.Map.Map consumes about 6N words of memory + size of keys & values (data taken from this excellent post by Johan Tibell). A lazy Text value takes up 7 words + 2*N bytes (rounded to the multiple of the machine word size), and a Word16 takes up two words (header + payload). We will assume a 64-bit machine, so the word size will be 8 bytes. We will also assume that the average string in the input is 8 characters long.

Taking this all into account, the final formula for the memory usage is 6*N + 7*N + 2*N + 2*N words.

In the worst case, all words will be different and there will be about (6 * 1024^3)/8 ~= 800 * 10^6 of them. Plugging that in the formula above we get the worst-case map size of approx. 102 GiB, which seems to agree with the experimental results. Solving this equation in the reverse direction tells us that your file contains about 200*10^6 different words.

As for alternative approaches to this problem, consider using a trie (as suggested by J.Abrahamson in the comments) or an approximate method, such as count-min sketch.

Processing a very large text file with lazy Texts and ByteStrings

Tags:

text

hashmap

haskell

bigdata

file-processing

haskelline

1 Answers

Mikhail Glushenkov

Recent Activity

Donate For Us

Processing a very large text file with lazy Texts and ByteStrings

Tags:

text

hashmap

haskell

bigdata

file-processing

haskelline

1 Answers

Mikhail Glushenkov

Related questions

Recent Activity

Donate For Us