Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell parse big xml file with low memory

So, I've played around with several Haskell XML libraries, including hexpat and xml-enumerator. After reading the IO chapter in Real World Haskell (http://book.realworldhaskell.org/read/io.html) I was under the impression that if I run the following code, it will be garbage collected as I go through it.

However, when I run it on a big file, memory usage keeps climbing as it runs.

runghc parse.hs bigfile.xml

What am I doing wrong? Is my assumption wrong? Does the map/filter force it to evaluate everything?

import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.UTF8 as U
import Prelude hiding (readFile)
import Text.XML.Expat.SAX 
import System.Environment (getArgs)

main :: IO ()
main = do
    args <- getArgs
    contents <- BSL.readFile (head args)
    -- putStrLn $ U.toString contents
    let events = parse defaultParseOptions contents 
    mapM_ print $ map getTMSId $ filter isEvent events

isEvent :: SAXEvent String String -> Bool 
isEvent (StartElement "event" as) = True
isEvent _ = False

getTMSId :: SAXEvent String String -> Maybe String
getTMSId (StartElement _ as) = lookup "TMSId" as

My end goal is to parse a huge xml file with a simple sax-like interface. I don't want to have to be aware of the whole structure to get notified that I've found an "event".

like image 661
Sean Clark Hess Avatar asked Nov 09 '11 15:11

Sean Clark Hess


1 Answers

I'm the maintainer of hexpat. This is a bug, which I have now fixed in hexpat-0.19.8. Thanks for drawing it to my attention.

The bug is new on ghc-7.2.1, and it's to do with an interaction that I didn't expect between a where clause binding to a triple, and unsafePerformIO, which I need to make the interaction with the C code appear pure in Haskell.

like image 186
blackh Avatar answered Sep 22 '22 01:09

blackh