What kind of lexer/parser was used in the very first C compiler? [closed]

Tags:

In the early 1970s, Dennis Ritchie wrote the very first C compiler.

In the year 2017, I wanted to write a C compiler. Books like Deep C Secrets (Peter Van Der Linden) say that C was, above all else, designed to be easy to compile. But I've been having an inordinate amount of trouble with it.

For starters, it's already relatively difficult to come up with Lex/Yacc specifications for the C language, and these tools didn't even exist yet when Ritchie made his compiler!

Plus, there are a great many examples of surprisingly small C compilers that do not use any help from Lex & Yacc. (Check out this tiny obfuscated C compiler from Fabrice Bellard. Note that his "production" tinycc source is actually quite a bit longer, most likely in an effort to accommodate more architectures, and to be more readable)

So what am I missing here? What kind of lexer/parser did Ritchie use in his compiler? Is there some easier way of writing compilers that I just haven't stumbled onto?

748

asked Sep 12 '17 00:09

Marco Merlini

1 Answers

Yacc's name is an abbreviation for "yet another compiler compiler", which strongly suggests that it was neither the first nor the second such tool.

Indeed, the Wikipedia article on History of Compiler Construction notes that

In the early 1960s, Robert McClure at Texas Instruments invented a compiler-compiler called TMG, the name taken from "transmogrification". In the following years TMG was ported to several UNIVAC and IBM mainframe computers.

…

Not long after Ken Thompson wrote the first version of Unix for the PDP-7 in 1969, Doug McIlroy created the new system's first higher-level language: an implementation of McClure's TMG. TMG was also the compiler definition tool used by Ken Thompson to write the compiler for the B language on his PDP-7 in 1970. B was the immediate ancestor of C.

That's not quite an answer to your question, but it provides some possibilities.

Original answer:

I wouldn't be at all surprised if Ritchie just banged together a hand-built top-down or operator precedence parser. The techniques were well-known, and the original C language presented few challenges. But parser generating tools definitely existed.

Postscript:

A comment on the OP by Alexey Frunze points to this early version of the C compiler. It's basically a recursive-descent top-down parser, up to the point where expressions need to be parsed at which point it uses a shunting-yard-like operator precedence grammar. (See the function tree in the first source file for the expression parser.) This style of starting with a top-down algorithm and switching to a bottom-up algorithm (such as operator-precedence) when needed is sometimes called "left corner" (LC) parsing.

So that's basically the architecture which I said wouldn't surprise me, and it didn't :).

It's worth noting that the compiler unearthed by Alexey (and also by @Torek in a comment to this post) does not handle anything close to what we generally consider the C language these days. In particular, it handles only a small subset of the declaration syntax (no structs or unions, for example), which is probably the most complicated part of the K&R C grammar. So it does not answer your question about how to produce a "simple" parser for C.

C is (mostly) parseable with an LALR(1) grammar, although you need to implement some version of the "lexer hack" in order to correctly parse cast expressions. The input to the parser (translation phase 7) will be a stream of tokens produced by the preprocessing code (translation phase 4, probably incorporating phases 5 and 6), which itself may draw upon a (f)lex tokenizer (phase 3) whose input will have been sanitized in some fashion according to phases 1 and 2. (See § 5.1.1.2 for a precise definition of the phases).

Sadly, (f)lex was not designed to be part of a pipeline; they really want to just handle the task of reading the source. However, flex can be convinced to let you provide chunks of input by redefining the YY_INPUT macro. Handling trigraphs (if you chose to do that) and line continuations can be done using a simple state machine; it's convenient that these transformations only shrink the input, simplifying handling of the maximum input length parameter to YY_INPUT. (Don't provide input one character at a time as suggested by the example in the flex manual.)

Since the preprocessor must produce a stream of tokens (at this point, whitespace is no longer important), it is convenient to use bison's push-parser interface. (Indeed, it is very often more convenient to use the push API.) If you take that suggestion, you will end up with phase 4 as the top-level driver of the parse.

You could hand-build a preprocessor-directive parser, but getting #if expressions and pragmas right suggests the use of a separate bison parser for preprocessing.

If you just want to learn how to build a compiler, you might want to start with a simpler language such as Tiger, the language used as a running example in Andrew Appel's excellent textbooks on compiler construction.

165

answered Oct 06 '22 01:10

rici

Related questions
                            
                                What happens if an invalid address is prefetched?
                            
                                LLDB - setting source code path
                            
                                Is rename required by standard to be atomic?
                            
                                Is there a portable/standard-compliant way to get filenames and linenumbers in a stack trace?
                            
                                Convert string to non string with C macro [duplicate]
                            
                                C String encoding Windows/Linux
                            
                                How to avoid Broadcast messages from syslog printed on the console
                            
                                Fatal error: start symbol does not derive any sentence
                            
                                Cython Pass String To C as Bytes
                            
                                Default stack size
                            
                                Why does FreeBSD's implementation of memchr increment its pointer in its condition?
                            
                                Not a constant initializer element?
                            
                                C library version conflicts, how to manage?
                            
                                gcc -g flag: Moving the Source Code
                            
                                nfhook (netfilter) error: assignment from incompatible pointer type
                            
                                Allocate file on disk without zeroing
                            
                                Bluetooth programming in C - Secure Connection and Data transfer
                            
                                Can I retrieve the source code of the library packaged with Conan package manager to be able to debug in it?
                            
                                fgets() call with redirection get abnormal data stream
                            
                                How to concatenate strings in the arguments of _Pragma

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What kind of lexer/parser was used in the very first C compiler? [closed]

Tags:

c

parsing

compiler-construction