Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing a very large text file with lazy Texts and ByteStrings

I'm trying to process a very large unicode text file (6GB+). What I want is to count the frequency of each unique word. I use a strict Data.Map to keep track of the counts of each word as I traverse the file. The process takes too much time and too much memory (20GB+). I suspect the Map is huge but I'm not sure it should reach 5x the size of the file! The code is shown below. Please note that I tried the following:

  • Using Data.HashMap.Strict instead of Data.Map.Strict. Data.Map seems to perform better in terms of slower memory consumption increase rate.

  • Reading the files using lazy ByteString instead of lazy Text. And then I encode it to Text do some processing and then encode it back to ByteString for IO.

    import Data.Text.Lazy (Text(..), cons, pack, append)
    import qualified Data.Text.Lazy as T
    import qualified Data.Text.Lazy.IO as TI
    import Data.Map.Strict hiding (foldr, map, foldl')
    import System.Environment
    import System.IO
    import Data.Word
    
    dictionate :: [Text] -> Map Text Word16
    dictionate = fromListWith (+) . (`zip` [1,1..])
    
    main = do
        [file,out] <- getArgs
        h <- openFile file ReadMode
        hO <- openFile out WriteMode
        mapM_ (flip hSetEncoding utf8) [h,hO]
        txt <- TI.hGetContents h
        TI.hPutStr hO . T.unlines . 
          map (uncurry ((. cons '\t' . pack . show) . append)) . 
          toList . dictionate . T.words $ txt
        hFlush hO
        mapM_ hClose [h,hO]
        print "success"
    

What's wrong with my approach? What's the best way to accomplish what I'm trying to do in terms of time and memory performance?

like image 870
haskelline Avatar asked Nov 04 '13 23:11

haskelline


1 Answers

This memory usage is expected. Data.Map.Map consumes about 6N words of memory + size of keys & values (data taken from this excellent post by Johan Tibell). A lazy Text value takes up 7 words + 2*N bytes (rounded to the multiple of the machine word size), and a Word16 takes up two words (header + payload). We will assume a 64-bit machine, so the word size will be 8 bytes. We will also assume that the average string in the input is 8 characters long.

Taking this all into account, the final formula for the memory usage is 6*N + 7*N + 2*N + 2*N words.

In the worst case, all words will be different and there will be about (6 * 1024^3)/8 ~= 800 * 10^6 of them. Plugging that in the formula above we get the worst-case map size of approx. 102 GiB, which seems to agree with the experimental results. Solving this equation in the reverse direction tells us that your file contains about 200*10^6 different words.

As for alternative approaches to this problem, consider using a trie (as suggested by J.Abrahamson in the comments) or an approximate method, such as count-min sketch.

like image 148
Mikhail Glushenkov Avatar answered Oct 27 '22 11:10

Mikhail Glushenkov