I'm experimenting with Haskell at the moment and am very much enjoying the experience, but am evaluating it for a real project with some fairly stringent performance requirements. The first pass of my task is to process a complete (no-history) dump of wikipedia (bzipped) - totalling about 6Gb compressed. In python a script to do a full extract of each raw page (about 10 million in total) takes about 30 minutes on my box (and for reference a scala implementation using the pull parser takes about 40 mins). I've been attempting to replicate this performance using Haskell and ghc and have been struggling to match this.
I've been using Codec.Compression.BZip for decompression and hexpat for parsing. I'm using lazy bytestrings as the input to hexpat and strict bytestrings for the element text type. And to extract the text for each page I'm building up a Dlist of pointers to text elements and then iterating over this to dump it out to stdout. The code I've just described has already been through a number of profiling/refactor iterations (I quickly moved from strings to bytestrings, then from string concatenation to lists of pointers to text - then to dlists of pointers to text). I think I've got about 2 orders of magnitude speedup from the original code, but it still takes over an hour and a half to parse (although it has a lovely small memory footprint). So I'm looking for a bit of inspiration from the community to get me the extra mile. The code is below (and I've broken it up into a number of subfunctions in order to get more detail from the profiler). Please excuse my Haskell - I've only been coding for a couple of days (having spent a week with Real World Haskell). And thanks in advance!
import System.Exit
import Data.Maybe
import Data.List
import Data.DList (DList)
import qualified Data.DList as DList
import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString.Lazy as LazyByteString
import qualified Codec.Compression.BZip as BZip
import Text.XML.Expat.Proc
import Text.XML.Expat.Tree
import Text.XML.Expat.Format
testFile = "../data/enwiki-latest-pages-articles.xml.bz2"
validPage pageData = case pageData of
(Just _, Just _) -> True
(_, _) -> False
scanChildren :: [UNode ByteString] -> DList ByteString
scanChildren c = case c of
h:t -> DList.append (getContent h) (scanChildren t)
[] -> DList.fromList []
getContent :: UNode ByteString -> DList ByteString
getContent treeElement =
case treeElement of
(Element name attributes children) -> scanChildren children
(Text text) -> DList.fromList [text]
rawData t = ((getContent.fromJust.fst) t, (getContent.fromJust.snd) t)
extractText page = do
revision <- findChild (BS.pack "revision") page
text <- findChild (BS.pack "text") revision
return text
pageDetails tree =
let pageNodes = filterChildren relevantChildren tree in
let getPageData page = (findChild (BS.pack "title") page, extractText page) in
map rawData $ filter validPage $ map getPageData pageNodes
where
relevantChildren node = case node of
(Element name attributes children) -> name == (BS.pack "page")
(Text _) -> False
outputPages pagesText = do
let flattenedPages = map DList.toList pagesText
mapM_ (mapM_ BS.putStr) flattenedPages
readCompressed fileName = fmap BZip.decompress (LazyByteString.readFile fileName)
parseXml byteStream = parse defaultParseOptions byteStream :: (UNode ByteString, Maybe XMLParseError)
main = do
rawContent <- readCompressed testFile
let (tree, mErr) = parseXml rawContent
let pages = pageDetails tree
let pagesText = map snd pages
outputPages pagesText
putStrLn "Complete!"
exitWith ExitSuccess
After running your program I get somewhat weird results:
./wikiparse +RTS -s -A5m -H5m | tail ./wikiparse +RTS -s -A5m -H5m 3,604,204,828,592 bytes allocated in the heap 70,746,561,168 bytes copied during GC 39,505,112 bytes maximum residency (37822 sample(s)) 2,564,716 bytes maximum slop 83 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 620343 collections, 0 parallel, 15.84s, 368.69s elapsed Generation 1: 37822 collections, 0 parallel, 1.08s, 33.08s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 243.85s (4003.81s elapsed) GC time 16.92s (401.77s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 260.77s (4405.58s elapsed) %GC time 6.5% (9.1% elapsed) Alloc rate 14,780,341,336 bytes per MUT second Productivity 93.5% of total user, 5.5% of total elapsed
Total time is more than OK I think: 260s is way faster than 30m for Python. I have no idea though why the overall time is so big here. I really don't think that reading 6Gb file would take more than an hour to complete.
I'm running your program again to check if the results are consistent.
If the result of those 4'20'' is right, then I believe something is wrong with the machine... or there is some other strange effect here.
The code was compiled on GHC 7.0.2.
Edit: I tried various versions of the program above. The most important optimization seems to be {-# INLINE #-} pragma and specialization of functions. Some have pretty generic types, which is known to be bad for performance. OTOH I believe inlining should be enough to trigger the specialization, so you should try to experiment further with this.
I didn't see any significant difference across the versions of GHC I tried (6.12 .. HEAD).
Haskell bindings to bzlib seems to have optimal performance. The following program, which is near-complete reimplementation of standard bzcat
program, is as fast or even faster than the original.
module Main where
import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip
import System.Environment (getArgs)
readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)
main :: IO ()
main = do
files <- getArgs
mapM_ (\f -> readCompressed f >>= BSL.putStr) files
On my machine it takes ~1100s to decompress the test file to /dev/null
. The fastest version I was able to get was based on SAX style parser. I'm not sure though if the output matches that of the original. On small outputs the result is the same, and so is the performance. On the original file the SAX version is somewhat faster and completes in ~2400s. You can find it below.
{-# LANGUAGE OverloadedStrings #-}
import System.Exit
import Data.Maybe
import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip
import System.IO
import Text.XML.Expat.SAX as SAX
type ByteStringL = BSL.ByteString
type Token = ByteString
type TokenParser = [SAXEvent Token Token] -> [[Token]]
testFile = "/tmp/enwiki-latest-pages-articles.xml.bz2"
readCompressed :: FilePath -> IO ByteStringL
readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)
{-# INLINE pageStart #-}
pageStart :: TokenParser
pageStart ((StartElement "page" _):xs) = titleStart xs
pageStart (_:xs) = pageStart xs
pageStart [] = []
{-# INLINE titleStart #-}
titleStart :: TokenParser
titleStart ((StartElement "title" _):xs) = finish "title" revisionStart xs
titleStart ((EndElement "page"):xs) = pageStart xs
titleStart (_:xs) = titleStart xs
titleStart [] = error "could not find <title>"
{-# INLINE revisionStart #-}
revisionStart :: TokenParser
revisionStart ((StartElement "revision" _):xs) = textStart xs
revisionStart ((EndElement "page"):xs) = pageStart xs
revisionStart (_:xs) = revisionStart xs
revisionStart [] = error "could not find <revision>"
{-# INLINE textStart #-}
textStart :: TokenParser
textStart ((StartElement "text" _):xs) = textNode [] xs
textStart ((EndElement "page"):xs) = pageStart xs
textStart (_:xs) = textStart xs
textStart [] = error "could not find <text>"
{-# INLINE textNode #-}
textNode :: [Token] -> TokenParser
textNode acc ((CharacterData txt):xs) = textNode (txt:acc) xs
textNode acc xs = (reverse acc) : textEnd xs
{-# INLINE textEnd #-}
textEnd {- , revisionEnd, pageEnd -} :: TokenParser
textEnd = finish "text" . finish "revision" . finish "page" $ pageStart
--revisionEnd = finish "revision" pageEnd
--pageEnd = finish "page" pageStart
{-# INLINE finish #-}
finish :: Token -> TokenParser -> TokenParser
finish tag cont ((EndElement el):xs) | el == tag = cont xs
finish tag cont (_:xs) = finish tag cont xs
finish tag _ [] = error (show (tag,("finish []" :: String)))
main :: IO ()
main = do
rawContent <- readCompressed testFile
let parsed = (pageStart (SAX.parse defaultParseOptions rawContent))
mapM_ (mapM_ BS.putStr) ({- take 5000 -} parsed) -- remove comment to finish early
putStrLn "Complete!"
Generally I'm suspicious that Python's and Scala's versions are finishing early. I couldn't verify that claim though without the source code.
To sum up: inlining and specialization should give reasonable, about two-fold increase in performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With