Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which XML parser for Haskell?

I'm trying to write some application, that performs analysis of data, stored in pretty big XML files (from 10 to 800MB). Each set of data is stored as single tag, with concrete data specified as attrobutes. I'm currently saxParse from HaXml, and I'm not satisfied with memory usage during work with it. On parsing of 15Mb XML file it consumes more than 1Gb of memory, although I tried to not to store data in the lists, and process it immediately. I use following code:

importOneFile file proc ioproc = do
  xml <- readFile file
  let (sxs, res) = saxParse file $ stripUnicodeBOM xml
  case res of
      Just str -> putStrLn $ "Error: " ++ str;
      Nothing -> forM_ sxs (ioproc . proc . (extractAttrs "row"))

where 'proc' - procedure, that performs conversion of data from attributes into record, and 'ioproc' - procedure, that performs some IO action - output to screen, storing in database, etc.

How i can decrease memory consumption during XML parsing? Should switching to another XML parser help?

Update: and which parser supports for different input encodings - utf-8, utf-16, utf-32, etc.?

like image 251
Alex Ott Avatar asked Jun 26 '09 09:06

Alex Ott


People also ask

What is Haskell parsing?

haskell Parsing is something every programmer does, all the time. Often, you are lucky, and the data you receive is structured according to some standard like json, xml … you name it. When it is, you just download a library for converting that format into native data types, and call it a day.

What is the Haskell XML toolbox?

The Haskell XML Toolbox is based on the ideas of HaXml and HXML, but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions.

What is HXT in Haskell?

The package hxt forms the core of the toolbox. It contains a validating XML parser and a HTML parser, which tries to read any text as HTML, a DSL for processing, transforming and generating XML/HTML, and so called pickler for conversion from/to XML and native Haskell data. HandsomeSoup adds CSS selectors to HXT.

Which combinators should I use in Haskell?

In Haskell, we prefer using parser combinators. I’ll take a couple of minutes to show you why. If you already know why it’s important to learn parser combinators, feel free to skip down to the heading ReadP .


1 Answers

If you're willing to assume that your inputs are valid, consider looking at TagSoup or Text.XML.Light from the Galois folks.

These take strings as input, so you can (indirectly) feed them anything Data.Encoding understands, namely

  • ASCII
  • UTF8
  • UTF16
  • UTF32
  • KOI8R
  • KOI8U
  • ISO88591
  • GB18030
  • BootString
  • ISO88592
  • ISO88593
  • ISO88594
  • ISO88595
  • ISO88596
  • ISO88597
  • ISO88598
  • ISO88599
  • ISO885910
  • ISO885911
  • ISO885913
  • ISO885914
  • ISO885915
  • ISO885916
  • CP1250
  • CP1251
  • CP1252
  • CP1253
  • CP1254
  • CP1255
  • CP1256
  • CP1257
  • CP1258
  • MacOSRoman
  • JISX0201
  • JISX0208
  • ISO2022JP
  • JISX0212
like image 114
Greg Bacon Avatar answered Nov 03 '22 09:11

Greg Bacon