Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can parser combinators be made efficient?

Around 6 years ago, I benchmarked my own parser combinators in OCaml and found that they were ~5× slower than the parser generators on offer at the time. I recently revisited this subject and benchmarked Haskell's Parsec vs a simple hand-rolled precedence climbing parser written in F# and was surprised to find the F# to be 25× faster than the Haskell.

Here's the Haskell code I used to read a large mathematical expression from file, parse and evaluate it:

import Control.Applicative import Text.Parsec hiding ((<|>))  expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')  term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')  fact = read <$> many1 digit <|> char '(' *> expr <* char ')'  eval :: String -> Int eval = either (error . show) id . parse expr "" . filter (/= ' ')  main :: IO () main = do     file <- readFile "expr"     putStr $ show $ eval file     putStr "\n" 

and here's my self-contained precedence climbing parser in F#:

let rec (|Expr|) = function   | P(f, xs) -> Expr(loop (' ', f, xs))   | xs -> invalidArg "Expr" (sprintf "%A" xs) and loop = function   | ' ' as oop, f, ('+' | '-' as op)::P(g, xs)   | (' ' | '+' | '-' as oop), f, ('*' | '/' as op)::P(g, xs) ->       let h, xs = loop (op, g, xs)       match op with       | '+' -> (+) | '-' -> (-) | '*' -> (*) | '/' | _ -> (/)       |> fun op -> loop (oop, op f h, xs)   | _, f, xs -> f, xs and (|P|_|) = function   | '('::Expr(f, ')'::xs) -> Some(P(f, xs))   | c::_ as xs when '0' <= c && c <= '9' ->       let rec loop n = function         | c2::xs when '0' <= c2 && c2 <= '9' -> loop (10*n + int(string c2)) xs         | xs -> Some(P(n, xs))       loop 0 xs   | _ -> None 

My impression is that even state-of-the-art parser combinators waste a lot of time back tracking. Is that correct? If so, is it possible to write parser combinators that generate state machines to obtain competitive performance or is it necessary to use code generation?

EDIT:

Here's the OCaml script I used to generate a ~2Mb expression for benchmarking:

open Printf  let rec f ff n =   if n=0 then fprintf ff "1" else     fprintf ff "%a+%a*(%a-%a)" f (n-1) f (n-1) f (n-1) f (n-1)  let () =   let n = try int_of_string Sys.argv.(1) with _ -> 3 in   fprintf stdout "%a\n" f n 
like image 472
J D Avatar asked Dec 30 '10 01:12

J D


1 Answers

I've come up with a Haskell solution that is 30× faster than the Haskell solution you posted (with my concocted test expression).

Major changes:

  1. Change Parsec/String to Attoparsec/ByteString
  2. In the fact function, change read & many1 digit to decimal
  3. Made the chainl1 recursion strict (remove $! for the lazier version).

I tried to keep everything else you had as similar as possible.

import Control.Applicative import Data.Attoparsec import Data.Attoparsec.Char8 import qualified Data.ByteString.Char8 as B  expr :: Parser Int expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')  term :: Parser Int term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')  fact :: Parser Int fact = decimal <|> char '(' *> expr <* char ')'  eval :: B.ByteString -> Int eval = either (error . show) id . eitherResult . parse expr . B.filter (/= ' ')  chainl1 :: (Monad f, Alternative f) => f a -> f (a -> a -> a) -> f a chainl1 p op = p >>= rest where   rest x = do f <- op               y <- p               rest $! (f x y)            <|> pure x  main :: IO () main = B.readFile "expr" >>= (print . eval) 

I guess what I concluded from this is that the majority of the slowdown for the parser combinator was that it was sitting on an inefficient base, not that it was a parser combinator, per se.

I imagine with more time and profiling this could go faster, as I stopped when I went past the 25× mark.

I don't know if this would be faster than the precedence climbing parser ported to Haskell. Maybe that would be an interesting test?

like image 69
Darius Jahandarie Avatar answered Sep 16 '22 15:09

Darius Jahandarie