Question
I know Parsec
and uu-parsinglib
and I've written parsers in both of them. Recently I discovered, that there is a problem in uu-parsinglib
, which could significantly affect its performance and I do not see a way to solve it.
Lets consider following Parsec parsers:
pa = char 'a'
pb = char 'b'
pTest = many $ try(pa <* pb)
What would be the equivalent in uu-parsinglib
? It would not be the following:
pa = pSym 'a'
pb = pSym 'b'
pTest = pList_ng (pa <* pb)
The difference is, that in Parsec
, many
would eat (pa <* pb)
(pairs of "ab"
) greedy until it is not matched anymore, while in uu-parsinglib
, pList_ng
is not greedy, so it will keep in memory possible backtrack ways after parsing each (pa <* pb)
.
Is there any way to write something like pList(try(pa <* pb))
in uu-parsinglib
?
Example
A good example would be
pExample = pTest <* (pa <* pb)
and a sample input of "ababab"
.
With Parsec
, we would get error (because pTest
is greedy parsing pairs of "ab"
), but in uu-parsinglib
, the string would be parsed with no problems.
Edit
We cannot switch from pList_ng
to pList
, because it would be not equivalent to Parsec
version. For example:
pExample2 = pTest <* pa
and a sample input of "ababa"
would success in Parsec
, but fail in uu-parsinglib
, when using greedy pList
.
Of course uu-parsinglib
will succeed if we use pList_ng
here, but it could be a lot slower for some inputs and rules. For example, considering the input "ababab"
, Parsec
would simply fail, because pTest
would consume whole string and pa
would not be matched. uu-parsinglib
will fail also, but checking a more steps - it will match whole string and fail, then throw away last "ab"
pair and fail again, etc. If we have some nested rules and funny input text, it will make a huge difference.
A little benchmark
To prove, that the problem exists in real, consider following grammar (in a pseudocode - but I think it is very intuitive):
pattern = many("ab") + "a"
input = many(pattern)
So as input to our program we get a string containing patterns, for example "abababaaba" contains 2 patterns "abababa" and "aba".
Lets make parsers in both libraries!
uu-parsinglib
:
import Data.Char
import qualified Text.ParserCombinators.UU as UU
import Text.ParserCombinators.UU hiding(parse)
import Text.ParserCombinators.UU.BasicInstances hiding (Parser)
import System.TimeIt (timeIt)
pa = pSym 'a'
pb = pSym 'b'
pTest = pList $ pList_ng ((\x y -> [x,y]) <$> pa <*> pb) <* pa
main :: IO ()
main = do
timeIt maininner
return ()
maininner = do
let (x,e) = UU.parse ((,) <$> pTest <*> pEnd) (createStr (LineColPos 0 0 0) (concat $ replicate 1000 (concat (replicate 1000 "ab") ++ "a")))
print $ length x
Parsec
:
import Control.Applicative
import Text.Parsec hiding (many, optional, parse, (<|>))
import qualified Text.Parsec as Parsec
import System.TimeIt (timeIt)
pa = char 'a'
pb = char 'b'
pTest = many $ many (try ((\x y -> [x,y]) <$> pa <*> pb)) <* pa
main :: IO ()
main = do
timeIt maininner2
return ()
maininner2 = do
let Right x = Parsec.runParser pTest (0::Int) "test" $ (concat $ replicate 1000 (concat (replicate 1000 "ab") ++ "a"))
print $ length x
Result? uu-parsinglib
is > 300% slower:
uu-parsinglib - 3.19s
Parsec - 1.04s
(compiled with ghc -O3
flag)
To understand the subtleties it is important to understand the differences between the try construct in Haskell and the non-greedy parsing strategy used in uu-parsinglib. Effectively the latter is a try which just looks ahead one symbol. In that respect it is less powerful than the try construct from Parsec, in which you specify that a specific construct has to be present completely. And then there is the underlying different overall strategy. Parsec uses a back-tracking strategy with explicit tries to commit, whereas uu-parsinglib uses a breadth-first strategy with an occasional single symbol look-ahead.
So it does not come as a surprise that there is a time difference between the two. In the Parsec case it is decided that the tried alternative does apply after having seen the complete construct (two symbols), whereas the greedy uu-parsinglib decides that this must be the right alternative after having successfully seen the first symbol. And this conclusion may be unjustified.
If one moves to the breadth-first strategy uu-parsinglib uses one has to keep track of several alternatives at the same time, and this take time. Two alternative, twice the time, etc.
The advantage of Parsec is that you can tune a back-tracking parser by liberal use of try constructs and by placing alternatives in the right order, but you are also more likely to ask questions on mailing list about why your parsers do not work as expected. You are not so much writing a grammar as writing parser. The uu-parsinglib starts from the other end of the spectrum: we try to accept quite a large collection of grammars (and the parsers implied by them).
My feeling is also that in the presence of try constructs having excellent error-repairing parsers is much more complicated. Once a try construct fails it is not possible to go back there and decide that, with a small correction, it is much better than the alternatives that come after it.
The behavior you're describing (using pList_ng
) does apply to other parsers (such as the simple list-of-successes method described in, for example, Jeroen Fokker's Functional Parsers) combinators), but not to uu-parsinglib. The library uses a breadth-first strategy to avoid space leaks (as a result of hanging on to the entire input, as would be the case when using a depth-first strategy). That is why I asked whether you had either created a test case or looked at the internals at all.
For a more technical description, see the paper in Text.ParserCombinators.UU.README (and maybe the source code after that). Here I'll use pExample2
to sketch the parsing process. Branching happens in pList_ng
(in pTest
), which uses <|>
to recognize either the empty string or another element. Because pTest
is followed by pa
, instead of the empty string, the alternative to parsing another element is actually recognizing a single 'a'
.
When we see the first 'a'
in the input, both alternatives can successfully parse this character. Next, we see a 'b'
. Now the alternative that parses only a single 'a'
cannot make any further progress, so we drop that one. There is one alternative left: the one that recognizes (a list of) 'a'
followed by 'b'
(pTest
). Next up is another 'a'
, and there are again two alternatives to consider. But then we see a 'b'
and, again, we can immediately drop the second alternative. Then there is one last characer, an 'a'
, which once more means two alternatives. But now we get to the end of the string and only the alternative obtained by letting pa
recognize the final a
leads to a succesful parse.
Considering the alternative input "ababab"
, we see that the pa
alternative fails again when we get to the final 'b'
, so only the pTest
alternative remains. That one finishes because we get to the end and then pa
(following pTest
in pExample2
) fails, so we get an error.
At any point, uu-parsinglib only needs to keep alternatives in memory that have not yet failed, and the breadth-first strategy ensures that all alternatives are evaluated in lockstep, so there is no need to hang on to the first 'a'
and 'b'
until the end of the string is reached and it does not match the whole string first before backtracking.
Edit:
From what I gather about Parsec, this is indeed less efficient, because Parsec does not consider pa
until pTest
finishes. Section 5.1 in the paper says a few words about something like try
for uu-parsinglib, including some objections. Raw speed is not a primary goal and I once saw a presentation of some benchmarks where uu-parsinglib also did not come out on top, but overall it performed reasonably well. If speed is so important for this compiler you mentioned and if you don't need any extra features such as online results or error correction, maybe you should just stick to Parsec? (Or look for more comprehensive benchmarks first.)
There are clearly significant differences between the two libraries, so I'm not surprised to see that Parsec is faster in some cases, although the difference in this case is indeed pretty big. Maybe there is a way to tweak the uu-parsinglib version further (without changing internals as hinted at by that section about greedy parsing in the paper), but it's not obvious (to me anyway).
Well, one thing you could do is rewrite the grammar:
pTest' = pList $ pa *> pList ((\x y -> [x,y]) <$$> pb <*> pa)
For Parsec that would the following I think (but this seems to make it slower):
pTest' = many $ pa *> many (flip (\x y -> [x,y]) <$> pb <*> pa)
This helps, but not enough to beat the Parsec version. Using your benchmark, I get the following results:
uu-parsinglib (pTest) - 1.98s
uu-parsinglib (pTest') - 1.11s
Parsec (pTest) - 0.59s
Parsec (pTest') - 0.67s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With