So, I've played around with several Haskell XML libraries, including hexpat and xml-enumerator. After reading the IO chapter in Real World Haskell (http://book.realworldhaskell.org/read/io.html) I was under the impression that if I run the following code, it will be garbage collected as I go through it.
However, when I run it on a big file, memory usage keeps climbing as it runs.
runghc parse.hs bigfile.xml
What am I doing wrong? Is my assumption wrong? Does the map/filter force it to evaluate everything?
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.UTF8 as U
import Prelude hiding (readFile)
import Text.XML.Expat.SAX
import System.Environment (getArgs)
main :: IO ()
main = do
args <- getArgs
contents <- BSL.readFile (head args)
-- putStrLn $ U.toString contents
let events = parse defaultParseOptions contents
mapM_ print $ map getTMSId $ filter isEvent events
isEvent :: SAXEvent String String -> Bool
isEvent (StartElement "event" as) = True
isEvent _ = False
getTMSId :: SAXEvent String String -> Maybe String
getTMSId (StartElement _ as) = lookup "TMSId" as
My end goal is to parse a huge xml file with a simple sax-like interface. I don't want to have to be aware of the whole structure to get notified that I've found an "event".
I'm the maintainer of hexpat. This is a bug, which I have now fixed in hexpat-0.19.8. Thanks for drawing it to my attention.
The bug is new on ghc-7.2.1, and it's to do with an interaction that I didn't expect between a where clause binding to a triple, and unsafePerformIO, which I need to make the interaction with the C code appear pure in Haskell.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With