I'm trying to create some kind of lint
tool for the C/AL programming language. So basically I need to perform syntax and lexical analysis against the source code. I've planned to write parser from the scratch, but then discovered that there are a lots of tools helping generate these parsers automatically.
I need performance, since checking 20 megabytes of code in one piece is the normal scenario, and I need that tool to be extendable by custom rules. So I decided to go with JavaScript.
As far I have found two generators I can use Jison and PEG.js.
Which of them give me more parsing performance? Maybe not comparing libraries, but algorithms?
Which one is better for my needs (parsing general purpose programming language)?
UPDATE: I have found similar Q&As:
In general you'd get very good parsing performance from a shift-reduce parser such as the one that Jison implements. It's a bit old-school perhaps, but it can work in very tight memory requirements and in linear time.
PEG produces a different kind of parser that is perhaps more capable, but would require more memory to produce the same result. I.e. PEG will require an amount of memory proportional to the input, while a LALR parser will do it in much less space (some tables and a small stack).
"I need performance (for 20Mb) ... so I decided Javascript". These are contradictory.
Carefully coded recursive descent parsers can be pretty fast, but you have to write a lot of code. Generally, LALR(1) parsers (generated by Bison from a grammar etc.) are quite fast, and pretty easy to code. (There are technical papers showing how to compile LALR parsers directly to machine code; these parsers are blazingly fast but you need to implement a lot of custom machinery to build one).
If you want flat out high performance parsing with minimal sweat, you should consider LRStar. (I know and highly respect the author but otherwise have nothing to do this). This produces very fast LALR parsers. Downside: you have to make your grammar LALR compatible. You will have to extend your "rules" the same way you extend any other C program: by writing C code. That doesn't seem a lot worse than writing JavaScript IMHO, but the rules will likely execute a lot faster and at the scale you are contemplating this will matter.
GLR parsing is necessarily slower than LALR because it has more bookkeeping to do. But that's just a constant factor. It can be significant (e.g., 100x) over a high performance engine like LRStar. It may be worth the trouble, because it is much easier to pound your grammar into shape, and a less convoluted grammar will likely make writing custom rules easier to do. If you really have millions lines of code, these parsers will like be medium speed at best.
PEG basically is a backtracking. It has to try things, and backtrack when they don't work. They have to be slower than LALR parsers by at least the amount of backtracking they do. You also have the grammar shaping problem.
What you will discover, though, is that parsing simply isn't enough if you want to do anything the faintest bit sophisticated. In that case, you don't want to optimize on parsing; you want to optimize on infrastructure for program analysis. See my essay on Life After Parsing for another alternative.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With