I'm trying to process a very large unicode text file (6GB+). What I want is to count the frequency of each unique word. I use a strict Data.Map
to keep track of the counts of each word as I traverse the file.
The process takes too much time and too much memory (20GB+). I suspect the Map is huge but I'm not sure it should reach 5x the size of the file!
The code is shown below. Please note that I tried the following:
Using Data.HashMap.Strict
instead of Data.Map.Strict
. Data.Map
seems to perform better in terms of slower memory consumption increase rate.
Reading the files using lazy ByteString
instead of lazy Text
. And then I encode it to Text do some processing and then encode it back to ByteString
for IO
.
import Data.Text.Lazy (Text(..), cons, pack, append)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import Data.Map.Strict hiding (foldr, map, foldl')
import System.Environment
import System.IO
import Data.Word
dictionate :: [Text] -> Map Text Word16
dictionate = fromListWith (+) . (`zip` [1,1..])
main = do
[file,out] <- getArgs
h <- openFile file ReadMode
hO <- openFile out WriteMode
mapM_ (flip hSetEncoding utf8) [h,hO]
txt <- TI.hGetContents h
TI.hPutStr hO . T.unlines .
map (uncurry ((. cons '\t' . pack . show) . append)) .
toList . dictionate . T.words $ txt
hFlush hO
mapM_ hClose [h,hO]
print "success"
What's wrong with my approach? What's the best way to accomplish what I'm trying to do in terms of time and memory performance?
This memory usage is expected. Data.Map.Map
consumes about 6N words of memory + size of keys & values (data taken from this excellent post by Johan Tibell). A lazy Text
value takes up 7 words + 2*N bytes (rounded to the multiple of the machine word size), and a Word16
takes up two words (header + payload). We will assume a 64-bit machine, so the word size will be 8 bytes. We will also assume that the average string in the input is 8 characters long.
Taking this all into account, the final formula for the memory usage is 6*N + 7*N + 2*N + 2*N
words.
In the worst case, all words will be different and there will be about (6 * 1024^3)/8 ~= 800 * 10^6
of them. Plugging that in the formula above we get the worst-case map size of approx. 102 GiB, which seems to agree with the experimental results. Solving this equation in the reverse direction tells us that your file contains about 200*10^6
different words.
As for alternative approaches to this problem, consider using a trie (as suggested by J.Abrahamson in the comments) or an approximate method, such as count-min sketch.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With