Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse a large XML file in Haskell with limited amount of resources?

I want to extract information from a large XML file (around 20G) in Haskell. Since it is a large file, I used SAX parsing functions from Hexpath.

Here is a simple code I tested:

import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax

parse :: FilePath -> IO ()
parse path = do
    inputText <- L.readFile path
    let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
    let txt = foldl' processEvent "" saxEvents
    putStrLn txt

After activating profiling in Cabal, it says that parse.saxEvents took 85% of allocated memory. I also used foldr and the result is the same.

If processEvent becomes complex enough, the program crashes with a stack space overflow error.

What am I doing wrong?

like image 699
kolam Avatar asked Oct 31 '22 07:10

kolam


1 Answers

You don't say what processEvent is like. In principle, it ought to be unproblematic to use lazy ByteString for a strict left fold over lazily generated input, so I'm not sure what is going wrong in your case. But one ought to use streaming-appropriate types when dealing with gigantic files!

In fact, hexpat does have 'streaming' interface (just like xml-conduit). It uses the not-too-well known List library and the rather ugly List class it defines. In principle the ListT type from the List package should work well. I gave up quickly because of a lack of combinators, and wrote an appropriate instance of the ugly List class for a wrapped version of Pipes.ListT which I then used to export ordinary Pipes.Producer functions like parseProduce. The trivial manipulations needed for this are appended below as PipesSax.hs

Once we have parseProducer we can convert a ByteString or Text Producer into a Producer of SaxEvents with Text or ByteString components. Here are some simple operations. I was using a 238M "input.xml"; the programs never need more than 6 mb of memory, to judge from looking at top.

-- Sax.hs Most of the IO actions use a registerIds pipe defined at the bottom which is tailored to a giant bit of xml of which this is a valid 1000 fragment http://sprunge.us/WaQK

{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX 
import Pipes  -- cabal install pipes pipes-bytestring 
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8

sax :: MonadIO m => Producer ByteString m () 
                 -> Producer (SAXEvent ByteString ByteString) m ()
sax =  parseProducer defaultParseOptions

-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 =  runEffect $ sax stdin >-> P.print

-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout

-- write all IDs to a file
main2 =  
 IO.withFile "input.xml" IO.ReadMode $ \inp -> 
 IO.withFile "output.txt" IO.WriteMode $ \out -> 
   runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out 

-- folds:
-- print number of IDs
main3 =  IO.withFile "input.xml" IO.ReadMode $ \inp -> 
           do n <- P.length $ sax (fromHandle inp) >-> registryIds
              print n

-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 =  IO.withFile "input.xml" IO.ReadMode $ \inp ->
         do let pipeline =  sax (fromHandle inp) >-> registryIds >-> P.map readIntId
            n <- P.fold (+) 0 id pipeline
            print n
  where
   readIntId :: ByteString -> Integer
   readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2

-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")] 
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do 
  e <- await  -- we look for a 'SAXEvent'
  case e of -- if it matches, we yield, else we go to the next event
    StartElement "FacilitySite" [("registryId",a)] -> do yield a
                                                         yield "\n"
                                                         registryIds
    _ -> registryIds  

-- 'library': PipesSax.hs

This just newtypes Pipes.ListT to get the appropriate instances. We don't export anything to do with List or ListT but just use the standard Pipes.Producer concept.

{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where 
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes  
import qualified Pipes.Internal as I

parseProducer
  :: (Monad m, GenericXMLString tag, GenericXMLString text) 
  => ParseOptions tag text
  -> Producer ByteString m () 
  -> Producer (SAXEvent tag text) m ()
parseProducer opt  = enumerate . enumerate_ 
                     . parseG opt 
                     . Select_ . Select

parseProducerLocations
  :: (Monad m, GenericXMLString tag, GenericXMLString text) 
  => ParseOptions tag text
  -> Producer ByteString m () 
  -> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt = 
  enumerate . enumerate_ . parseLocationsG opt . Select_ . Select  

newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
    deriving (Functor, Monad, MonadPlus, MonadIO
             , Applicative, Alternative, Monoid, MonadTrans)

instance Monad m => List (ListT_ m) where
 type ItemM (ListT_ m) = m
 joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_) 
 runList   = liftM emend  . next  . enumerate . enumerate_
   where 
     emend (Right (a,q)) = Cons a (Select_ (Select q))
     emend _ = Nil
like image 120
Michael Avatar answered Nov 08 '22 07:11

Michael