Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing large log files in Haskell

Tags:

haskell

Suppose I have several 200mb+ files that I want to grep through. How would I do this in Haskell?

Here's my initial program:

import Data.List
import Control.Monad
import System.IO
import System.Environment

main = do
  filename <- liftM head getArgs
  contents <- liftM lines $ readFile filename
  putStrLn . unlines . filter (isPrefixOf "import") $ contents

This reads the whole file into memory before parsing through it. Then I went with this:

import Data.List
import Control.Monad
import System.IO
import System.Environment

main = do
  filename <- liftM head getArgs
  file <- (openFile filename ReadMode)
  contents <- liftM lines $ hGetContents file
  putStrLn . unlines . filter (isPrefixOf "import") $ contents

I thought since hGetContents is lazy, it will avoid reading the whole file into memory. But running both scripts under valgrind showed similar memory usage for both. So either my script is wrong, or valgrind is wrong. I compile the scripts using

ghc --make test.hs -prof

What am I missing? Bonus question: I see a lot of mentions on SO of how Lazy IO in Haskell is actually a bad thing. How / why would I use strict IO?

Update:

So it looks like I was wrong in my reading of valgrind. Using +RTS -s, here's what I get:

 7,807,461,968 bytes allocated in the heap
 1,563,351,416 bytes copied during GC
       101,888 bytes maximum residency (1150 sample(s))
        45,576 bytes maximum slop
             2 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 13739 collections,     0 parallel,  2.91s,  2.95s elapsed
Generation 1:  1150 collections,     0 parallel,  0.18s,  0.18s elapsed

INIT  time    0.00s  (  0.00s elapsed)
MUT   time    2.07s  (  2.28s elapsed)
GC    time    3.09s  (  3.13s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time    5.16s  (  5.41s elapsed)

The important line is 101,888 bytes maximum residency, which says that at any given point my script was using 101 kb of memory at most. The file I was grepping through was 44 mb. So I think the verdict is: readFile and hGetContents are both lazy.

Follow-up question:

Why do I see 7gb of memory allocated on the heap? That seems really high for a script that's reading in a 44 mb file.

Update to follow-up question

Looks like a few gb of memory allocated on the heap is not atypical for Haskell, so no cause for concern. Using ByteStrings instead of Strings takes the memory usage down a lot:

  81,617,024 bytes allocated in the heap
      35,072 bytes copied during GC
      78,832 bytes maximum residency (1 sample(s))
      26,960 bytes maximum slop
           2 MB total memory in use (0 MB lost due to fragmentation)
like image 662
Vlad the Impala Avatar asked Mar 17 '12 01:03

Vlad the Impala


2 Answers

Please, don't use Strings (especially when processing >100 Mb files). Just replace them with ByteStrings (or Data.Text):

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad
import System.Environment
import qualified Data.ByteString.Lazy.Char8 as B

main = do
  filename <- liftM getArgs
  contents <- liftM B.lines $ B.readFile filename
  B.putStrLn . B.unlines . filter (B.isPrefixOf "import") $ contents

And I bet, this will be several times faster.

UPD: regarding your follow-up question.
Amount of allocated memory is strongly connected to the magic speedup when switching to bytestrings.
As String is just a generic list, it requires extra memory for each Char: pointer to next element, object header, etc. All this memory needs to be allocated and then collected back. This requires a lot of computational power.
On the other side, ByteString is a list of chunks, i.e. continuous blocks of memory (I think, not less than 64 bytes each). This greatly reduces number of allocations and collections, and improves cache locality also.

like image 107
max taldykin Avatar answered Nov 16 '22 10:11

max taldykin


Both readFile and hGetContents should be lazy. Try running your program with +RTS -s and see how much memory is actually used. What makes you think the entire file is read into memory?

As for the second part of your question, lazy IO is sometimes at the root of unexpected space leaks or resource leaks. Not really the fault of lazy IO in and of itself, but determining whether its leaky requires analyzing how it's used.

like image 45
ephemient Avatar answered Nov 16 '22 11:11

ephemient