In the early 1970s, Dennis Ritchie wrote the very first C compiler.
In the year 2017, I wanted to write a C compiler. Books like Deep C Secrets (Peter Van Der Linden) say that C was, above all else, designed to be easy to compile. But I've been having an inordinate amount of trouble with it.
For starters, it's already relatively difficult to come up with Lex/Yacc specifications for the C language, and these tools didn't even exist yet when Ritchie made his compiler!
Plus, there are a great many examples of surprisingly small C compilers that do not use any help from Lex & Yacc. (Check out this tiny obfuscated C compiler from Fabrice Bellard. Note that his "production" tinycc source is actually quite a bit longer, most likely in an effort to accommodate more architectures, and to be more readable)
So what am I missing here? What kind of lexer/parser did Ritchie use in his compiler? Is there some easier way of writing compilers that I just haven't stumbled onto?
The first C compiler, written by Dennis Ritchie, used a recursive descent parser, incorporated specific knowledge about the PDP-11, and relied on an optional machine-specific optimizer to improve the assembly language code it generated.
A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser then scans the tokens and produces the parsing result.
The Lexer is responsible for actually processing the component source code and finding the Mason directives within it. It interacts quite closely with the Compiler, which takes the Lexer's output and generates a Mason component object suitable for interpretation at runtime.
The parser is mainly classified into two categories, i.e. Top-down Parser, and Bottom-up Parser.
Yacc's name is an abbreviation for "yet another compiler compiler", which strongly suggests that it was neither the first nor the second such tool.
Indeed, the Wikipedia article on History of Compiler Construction notes that
In the early 1960s, Robert McClure at Texas Instruments invented a compiler-compiler called TMG, the name taken from "transmogrification". In the following years TMG was ported to several UNIVAC and IBM mainframe computers.
…
Not long after Ken Thompson wrote the first version of Unix for the PDP-7 in 1969, Doug McIlroy created the new system's first higher-level language: an implementation of McClure's TMG. TMG was also the compiler definition tool used by Ken Thompson to write the compiler for the B language on his PDP-7 in 1970. B was the immediate ancestor of C.
That's not quite an answer to your question, but it provides some possibilities.
I wouldn't be at all surprised if Ritchie just banged together a hand-built top-down or operator precedence parser. The techniques were well-known, and the original C language presented few challenges. But parser generating tools definitely existed.
A comment on the OP by Alexey Frunze points to this early version of the C compiler. It's basically a recursive-descent top-down parser, up to the point where expressions need to be parsed at which point it uses a shunting-yard-like operator precedence grammar. (See the function tree
in the first source file for the expression parser.) This style of starting with a top-down algorithm and switching to a bottom-up algorithm (such as operator-precedence) when needed is sometimes called "left corner" (LC) parsing.
So that's basically the architecture which I said wouldn't surprise me, and it didn't :).
It's worth noting that the compiler unearthed by Alexey (and also by @Torek in a comment to this post) does not handle anything close to what we generally consider the C language these days. In particular, it handles only a small subset of the declaration syntax (no structs or unions, for example), which is probably the most complicated part of the K&R C grammar. So it does not answer your question about how to produce a "simple" parser for C.
C is (mostly) parseable with an LALR(1) grammar, although you need to implement some version of the "lexer hack" in order to correctly parse cast expressions. The input to the parser (translation phase 7) will be a stream of tokens produced by the preprocessing code (translation phase 4, probably incorporating phases 5 and 6), which itself may draw upon a (f)lex tokenizer (phase 3) whose input will have been sanitized in some fashion according to phases 1 and 2. (See § 5.1.1.2 for a precise definition of the phases).
Sadly, (f)lex was not designed to be part of a pipeline; they really want to just handle the task of reading the source. However, flex can be convinced to let you provide chunks of input by redefining the YY_INPUT macro. Handling trigraphs (if you chose to do that) and line continuations can be done using a simple state machine; it's convenient that these transformations only shrink the input, simplifying handling of the maximum input length parameter to YY_INPUT
. (Don't provide input one character at a time as suggested by the example in the flex manual.)
Since the preprocessor must produce a stream of tokens (at this point, whitespace is no longer important), it is convenient to use bison's push-parser interface. (Indeed, it is very often more convenient to use the push API.) If you take that suggestion, you will end up with phase 4 as the top-level driver of the parse.
You could hand-build a preprocessor-directive parser, but getting #if
expressions and pragma
s right suggests the use of a separate bison parser for preprocessing.
If you just want to learn how to build a compiler, you might want to start with a simpler language such as Tiger, the language used as a running example in Andrew Appel's excellent textbooks on compiler construction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With