Does anyone know what the weakest family of widely-used parsing algorithms is that can parse C code? That is, is the C grammar LL(1), LR(0), LALR(1), etc.? I'm curious because as a side project I'm interested in writing a parser generator for one of these families and would like to ultimately be able to parse C code for another side project.
C is (mostly) parseable with an LALR(1) grammar, although you need to implement some version of the "lexer hack" in order to correctly parse cast expressions.
The LR paring algorithm is one of the most efficient parsing algorithms. It is totally deterministic and no backtracking or search is involved.
The Document Parsing algorithm breaks up a document into its most extensive constituents, typically sentences and clauses. The initial step is usually to convert the sentences of the source text into their stem format called the Sentence Graph. Document parsing also includes tokenization.
It seems that Bison uses an LALR(1) parser. LALR parsers are more robust than LL parsers, but are also more complex. From this I suspect that LALR(1) is probably the weakest parsing algorithm which can parse C code.
Unless you're really set on rolling your own recognizer. ANTLR would probably be your best bet to do this. ANTLR uses an LL* algorithm (which is, effectively, LALR).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With