I'm trying to write a C parser, for my own education. I know that I could use tools like YACC to simplify the process, but I want to learn as much as possible from the experience, so I'm starting from scratch.
My question is how I should handle a line like this:
doSomethingWith((foo)(bar));
It could be that (foo)(bar)
is a type cast, as in:
typedef int foo;
void doSomethingWith(foo aFoo) { ... }
int main() {
float bar = 23.6;
doSomethingWith((foo)(bar));
return 0;
}
Or, it could be that (foo)(bar)
is a function call, as in:
int foo(int bar) { return bar; }
void doSomethingWith(int anInt) { ... }
int main() {
int bar = 10;
doSomethingWith((foo)(bar));
return 0;
}
It seems to me that the parser cannot determine which of the two cases it is dealing with solely by looking at the line doSomethingWith((foo)(bar));
This annoys me, because I was hoping to be able to separate the parsing stage from the "interpretation" stage where you actually determine that the line typedef int foo;
means that foo
is now a valid type. In my imagined scenario, Type a = b + c * d
would parse just fine, even if Type, a, b, c, and d aren't defined anywhere, and problems would only arise later, when actually trying to "resolve" the identifiers.
So, my question is: how do "real" C parsers deal with this? Is the separation between the two stages that I was hoping for just a naive wish, or am I missing something?
Historically, typedefs were a relatively late addition to C. Before they were added to the language, type names consisted of keywords (int
, char
, double
, struct
, etc.) and punctuation characters (*
, []
, ()
), and so were easy to recognize unambiguously. An identifier could never be a type name, so an identifier in parentheses followed by an expression could not be a cast expression.
Typedefs made it possible for a user-defined identifier to be a type name, which rather seriously messed up the grammar.
Take a look at the syntax of type-specifier in the C standard (I'll use the C90 version since it's slightly simpler):
type-specifier:
void
char
short
int
long
float
double
signed
unsigned
struct-or-union-specifier
enum-specifier
typedef-name
All but the last can be easily recognized because they either are keywords, or start with a keyword. But a typedef-name is just an identifier.
When a C compiler processes a typedef
declaration, it needs to, in effect, introduce the typedef name as a new keyword. Which means that, unlike for a language with a context-free grammar, there needs to be feedback from the symbol table to the parser.
And even that's a bit of an oversimplification. A typedef name can still be redefined, either as another typedef or as something else, in an inner scope:
{
typedef int foo; /* foo is a typedef name */
{
int foo; /* foo is now an ordinary identifier, an object name */
}
/* And now foo is a typedef name again */
}
So a typedef name is effectively a user-defined keyword if it's used in a context where a type name is valid, but is still an ordinary identifier if it's redeclared.
TL;DR: Parsing C is hard.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With