I'm trying to write some application, that performs analysis of data, stored in pretty big XML files (from 10 to 800MB). Each set of data is stored as single tag, with concrete data specified as attrobutes. I'm currently saxParse from HaXml, and I'm not satisfied with memory usage during work with it. On parsing of 15Mb XML file it consumes more than 1Gb of memory, although I tried to not to store data in the lists, and process it immediately. I use following code:
importOneFile file proc ioproc = do
xml <- readFile file
let (sxs, res) = saxParse file $ stripUnicodeBOM xml
case res of
Just str -> putStrLn $ "Error: " ++ str;
Nothing -> forM_ sxs (ioproc . proc . (extractAttrs "row"))
where 'proc' - procedure, that performs conversion of data from attributes into record, and 'ioproc' - procedure, that performs some IO action - output to screen, storing in database, etc.
How i can decrease memory consumption during XML parsing? Should switching to another XML parser help?
Update: and which parser supports for different input encodings - utf-8, utf-16, utf-32, etc.?
haskell Parsing is something every programmer does, all the time. Often, you are lucky, and the data you receive is structured according to some standard like json, xml … you name it. When it is, you just download a library for converting that format into native data types, and call it a day.
The Haskell XML Toolbox is based on the ideas of HaXml and HXML, but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions.
The package hxt forms the core of the toolbox. It contains a validating XML parser and a HTML parser, which tries to read any text as HTML, a DSL for processing, transforming and generating XML/HTML, and so called pickler for conversion from/to XML and native Haskell data. HandsomeSoup adds CSS selectors to HXT.
In Haskell, we prefer using parser combinators. I’ll take a couple of minutes to show you why. If you already know why it’s important to learn parser combinators, feel free to skip down to the heading ReadP .
If you're willing to assume that your inputs are valid, consider looking at TagSoup or Text.XML.Light from the Galois folks.
These take strings as input, so you can (indirectly) feed them anything Data.Encoding understands, namely
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With