Optimising a Haskell XML parser

Question

I'm experimenting with Haskell at the moment and am very much enjoying the experience, but am evaluating it for a real project with some fairly stringent performance requirements. The first pass of my task is to process a complete (no-history) dump of wikipedia (bzipped) - totalling about 6Gb compressed. In python a script to do a full extract of each raw page (about 10 million in total) takes about 30 minutes on my box (and for reference a scala implementation using the pull parser takes about 40 mins). I've been attempting to replicate this performance using Haskell and ghc and have been struggling to match this.

I've been using Codec.Compression.BZip for decompression and hexpat for parsing. I'm using lazy bytestrings as the input to hexpat and strict bytestrings for the element text type. And to extract the text for each page I'm building up a Dlist of pointers to text elements and then iterating over this to dump it out to stdout. The code I've just described has already been through a number of profiling/refactor iterations (I quickly moved from strings to bytestrings, then from string concatenation to lists of pointers to text - then to dlists of pointers to text). I think I've got about 2 orders of magnitude speedup from the original code, but it still takes over an hour and a half to parse (although it has a lovely small memory footprint). So I'm looking for a bit of inspiration from the community to get me the extra mile. The code is below (and I've broken it up into a number of subfunctions in order to get more detail from the profiler). Please excuse my Haskell - I've only been coding for a couple of days (having spent a week with Real World Haskell). And thanks in advance!

import System.Exit
import Data.Maybe
import Data.List
import Data.DList (DList)
import qualified Data.DList as DList

import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString.Lazy as LazyByteString
import qualified Codec.Compression.BZip as BZip

import Text.XML.Expat.Proc
import Text.XML.Expat.Tree
import Text.XML.Expat.Format

testFile = "../data/enwiki-latest-pages-articles.xml.bz2"

validPage pageData = case pageData of
    (Just _, Just _) -> True
    (_, _) -> False

scanChildren :: [UNode ByteString] -> DList ByteString
scanChildren c = case c of
    h:t -> DList.append (getContent h) (scanChildren t)
    []  -> DList.fromList []

getContent :: UNode ByteString -> DList ByteString
getContent treeElement =
    case treeElement of
        (Element name attributes children)  -> scanChildren children
        (Text text)                         -> DList.fromList [text]

rawData t = ((getContent.fromJust.fst) t, (getContent.fromJust.snd) t)

extractText page = do
    revision <- findChild (BS.pack "revision") page
    text <- findChild (BS.pack "text") revision
    return text

pageDetails tree =
    let pageNodes = filterChildren relevantChildren tree in
    let getPageData page = (findChild (BS.pack "title") page, extractText page) in
    map rawData $ filter validPage $ map getPageData pageNodes
    where
        relevantChildren node = case node of
            (Element name attributes children) -> name == (BS.pack "page")
            (Text _) -> False

outputPages pagesText = do
    let flattenedPages = map DList.toList pagesText
    mapM_ (mapM_ BS.putStr) flattenedPages

readCompressed fileName = fmap BZip.decompress (LazyByteString.readFile fileName)
parseXml byteStream = parse defaultParseOptions byteStream :: (UNode ByteString, Maybe XMLParseError)

main = do
    rawContent <- readCompressed testFile
    let (tree, mErr) = parseXml rawContent
    let pages = pageDetails tree
    let pagesText = map snd pages
    outputPages pagesText
    putStrLn "Complete!"
    exitWith ExitSuccess

Tener · Accepted Answer

After running your program I get somewhat weird results:

./wikiparse  +RTS -s -A5m -H5m | tail
./wikiparse +RTS -s -A5m -H5m
3,604,204,828,592 bytes allocated in the heap
  70,746,561,168 bytes copied during GC
      39,505,112 bytes maximum residency (37822 sample(s))
       2,564,716 bytes maximum slop
              83 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 620343 collections,     0 parallel, 15.84s, 368.69s elapsed
  Generation 1: 37822 collections,     0 parallel,  1.08s, 33.08s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time  243.85s  (4003.81s elapsed)
  GC    time   16.92s  (401.77s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time  260.77s  (4405.58s elapsed)

  %GC time       6.5%  (9.1% elapsed)

  Alloc rate    14,780,341,336 bytes per MUT second

  Productivity  93.5% of total user, 5.5% of total elapsed

Total time is more than OK I think: 260s is way faster than 30m for Python. I have no idea though why the overall time is so big here. I really don't think that reading 6Gb file would take more than an hour to complete.

I'm running your program again to check if the results are consistent.

If the result of those 4'20'' is right, then I believe something is wrong with the machine... or there is some other strange effect here.

The code was compiled on GHC 7.0.2.

Edit: I tried various versions of the program above. The most important optimization seems to be {-# INLINE #-} pragma and specialization of functions. Some have pretty generic types, which is known to be bad for performance. OTOH I believe inlining should be enough to trigger the specialization, so you should try to experiment further with this.

I didn't see any significant difference across the versions of GHC I tried (6.12 .. HEAD).

Haskell bindings to bzlib seems to have optimal performance. The following program, which is near-complete reimplementation of standard bzcat program, is as fast or even faster than the original.

module Main where

import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip
import System.Environment (getArgs)

readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)

main :: IO ()
main = do
    files <- getArgs
    mapM_ (\f -> readCompressed f >>= BSL.putStr) files

On my machine it takes ~1100s to decompress the test file to /dev/null. The fastest version I was able to get was based on SAX style parser. I'm not sure though if the output matches that of the original. On small outputs the result is the same, and so is the performance. On the original file the SAX version is somewhat faster and completes in ~2400s. You can find it below.

{-# LANGUAGE OverloadedStrings #-}

import System.Exit
import Data.Maybe

import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip

import System.IO

import Text.XML.Expat.SAX as SAX

type ByteStringL = BSL.ByteString
type Token = ByteString
type TokenParser = [SAXEvent Token Token] -> [[Token]]

testFile = "/tmp/enwiki-latest-pages-articles.xml.bz2"


readCompressed :: FilePath -> IO ByteStringL
readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)

{-# INLINE pageStart #-}
pageStart :: TokenParser
pageStart ((StartElement "page" _):xs) = titleStart xs
pageStart (_:xs) = pageStart xs
pageStart [] = []

{-# INLINE titleStart #-}
titleStart :: TokenParser
titleStart ((StartElement "title" _):xs) = finish "title" revisionStart xs
titleStart ((EndElement "page"):xs) = pageStart xs
titleStart (_:xs) = titleStart xs
titleStart [] = error "could not find <title>"


{-# INLINE revisionStart #-}
revisionStart :: TokenParser
revisionStart ((StartElement "revision" _):xs) = textStart xs
revisionStart ((EndElement "page"):xs) = pageStart xs
revisionStart (_:xs) = revisionStart xs
revisionStart [] = error "could not find <revision>"

{-# INLINE textStart #-}
textStart :: TokenParser
textStart ((StartElement "text" _):xs) = textNode [] xs
textStart ((EndElement "page"):xs) = pageStart xs
textStart (_:xs) = textStart xs
textStart [] = error "could not find <text>"

{-# INLINE textNode #-}
textNode :: [Token] -> TokenParser
textNode acc ((CharacterData txt):xs) = textNode (txt:acc) xs
textNode acc xs = (reverse acc) : textEnd xs

{-# INLINE textEnd #-}
textEnd {- , revisionEnd, pageEnd -} :: TokenParser
textEnd = finish "text" . finish "revision" . finish "page" $ pageStart
--revisionEnd = finish "revision" pageEnd
--pageEnd = finish "page" pageStart

{-# INLINE finish #-}
finish :: Token -> TokenParser -> TokenParser
finish tag cont ((EndElement el):xs) | el == tag = cont xs
finish tag cont (_:xs) = finish tag cont xs
finish tag _ [] = error (show (tag,("finish []" :: String)))

main :: IO ()
main = do
  rawContent <- readCompressed testFile
  let parsed = (pageStart (SAX.parse defaultParseOptions rawContent))
  mapM_ (mapM_ BS.putStr) ({- take 5000 -} parsed) -- remove comment to finish early
  putStrLn "Complete!"

Generally I'm suspicious that Python's and Scala's versions are finishing early. I couldn't verify that claim though without the source code.

To sum up: inlining and specialization should give reasonable, about two-fold increase in performance.

Optimising a Haskell XML parser

Tags:

performance

xml

haskell

Alex Wilson

1 Answers

Tener

Recent Activity

Donate For Us

Optimising a Haskell XML parser

Tags:

performance

xml

haskell

Alex Wilson

1 Answers

Tener

Related questions

Recent Activity

Donate For Us