Background Information:
I have a desire to make a programming language, knowing the tools to do so, I don't have any good examples on how to use them. I really do not want to use Flex or Bison as it doesn't teach the abstractness that I feel is needed for creating a compiler. I have the concepts of creating strings, tokenizing them, feeding them to a file that acts as grammar and parses eventually creating an actual program to run the language. The problem is, I don't know how to write a tokenizer or parser. I have general ideas, but I understand better when I can see examples. If someone could post an/few example(s), that would be great!
My question is as follows:
Can someone post examples of how to do write a syntax tokenizer/parser in C?
If you want to write a very complex syntax parser in C, without using any existing pattern matching code, it's usually best to implement a state machine and then process source code char by char.
The output of Flex+Bison is also just a state machine. Flex uses regular expressions to tokenize a string into tokens that are then passed to a Bison state machine, processing one token after another, depending on the current state of the machine. But you don't need a regex tokenizer, you can tokenize the input as part of the state machine processing. A regex matcher itself can also be implemented as a state machine, so token generation can be directly part of your state machine.
Here's an interesting link for you; it's not C in particular, more a general overview how state machines work, but once you got the concept, it's easy to transpose that to C code:
Parsing command line arguments using a finite state machine and backtracking
Here's some sample code of a super primitive CSV parser:
#include <stdlib.h>
#include <stdio.h>
static char currentToken[4096];
static size_t currentTokenLength;
static
void addCharToCurrentToken ( char c ) {
if (currentTokenLength < sizeof(currentToken)) {
currentToken[currentTokenLength++] = c;
}
}
static
void printCurrentToken ( ) {
printf("Token: >>>%.*s<<<\n", (int)currentTokenLength, currentToken);
currentTokenLength = 0;
}
typedef enum {
STATE_FindStartOfData,
STATE_FindStartOfToken,
STATE_ParseNumber,
STATE_ParseString,
STATE_CheckEndOfString,
STATE_FindDelimiter,
STATE_ParseError,
STATE_EndOfData
} ParserState;
ParserState parserState = STATE_FindStartOfData;
static
void runTheStateMachine ( ) {
while (parserState != STATE_ParseError
&& parserState != STATE_EndOfData
) {
int c = fgetc(stdin);
// End of data?
if (c == -1) {
switch (parserState) {
case STATE_ParseNumber:
case STATE_CheckEndOfString:
printCurrentToken();
parserState = STATE_EndOfData;
break;
case STATE_ParseString:
// Data ends in the middle of token parsing? No way!
fprintf(stderr, "Data ended abruptly!\n");
parserState = STATE_ParseError;
break;
case STATE_FindStartOfData:
case STATE_FindStartOfToken:
case STATE_FindDelimiter:
// This is okay, data stream may end while in these states
parserState = STATE_EndOfData;
break;
case STATE_ParseError:
case STATE_EndOfData:
break;
}
}
switch (parserState) {
case STATE_FindStartOfData:
// Skip blank lines
if (c == '\n' || c == '\r') break;
// !!!FALLTHROUGH!!!
case STATE_FindStartOfToken:
// Skip overe all whitespace
if (c == ' ' || c == '\t') break;
// Start of string?
if (c == '"') {
parserState = STATE_ParseString;
break;
}
// Blank field?
if (c == ',') {
printCurrentToken();
break;
}
// End of dataset?
if (c == '\n' || c == '\r') {
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Everything else can only be a number
parserState = STATE_ParseNumber;
addCharToCurrentToken(c);
break;
case STATE_ParseNumber:
if (c == ' ' || c == '\t') {
// Numbers cannot contain spaces in the middle,
// so this must be the end of the number.
printCurrentToken();
// We still need to find the real delimiter, though.
parserState = STATE_FindDelimiter;
break;
}
if (c == ',') {
// This time the number ends directly with a delimiter
printCurrentToken();
parserState = STATE_FindStartOfToken;
break;
}
// End of dataset?
if (c == '\n' || c == '\r') {
printCurrentToken();
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Otherwise keep reading the number
addCharToCurrentToken(c);
break;
case STATE_ParseString:
if (c == '"') {
// Either this is the regular end of the string or it is just an
// escaped quotation mark which is doubled ("") in CVS.
parserState = STATE_CheckEndOfString;
break;
}
// All other chars are just treated as ordinary chars
addCharToCurrentToken(c);
break;
case STATE_CheckEndOfString:
if (c == '"') {
// Next char is also a quotation mark,
// so this was not the end of the string.
addCharToCurrentToken(c);
parserState = STATE_ParseString;
break;
}
if (c == ' ' || c == '\t') {
// It was the end of the string
printCurrentToken();
// We still need to find the real delimiter, though.
parserState = STATE_FindDelimiter;
break;
}
if (c == ',') {
// It was the end of the string
printCurrentToken();
// And we even found the delimiter
parserState = STATE_FindStartOfToken;
break;
}
if (c == '\n' || c == '\r') {
// It was the end of the string
printCurrentToken();
// And we even found the end of this dataset
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Everything else is a parse error I guess
fprintf(stderr, "Unexpected char 0x%02X after end of string!\n", c);
parserState = STATE_ParseError;
break;
case STATE_FindDelimiter:
// Delemiter found?
if (c == ',') {
parserState = STATE_FindStartOfToken;
break;
}
// Just skip overe all whitespace
if (c == ' ' || c == '\t') break;
// End of dataset?
if (c == '\n' || c == '\r') {
// And we even found the end of this dataset
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Anything else a pare error I guess
fprintf(stderr, "Unexpected char 0x%02X after end of token!\n", c);
parserState = STATE_ParseError;
break;
case STATE_ParseError:
// Nothing to do
break;
case STATE_EndOfData:
// Nothing to do
break;
}
}
}
int main ( ) {
runTheStateMachine();
return (parserState == STATE_EndOfData ? 0 : 1);
}
The code makes the following assumptions:
This code can definitely not parse any CSV data you feed it, but when you feed it that file:
"Year","Brand","Model" ,"Description", "Price"
1997,"Ford", "E350","ac, abs, moon", 3000.00
1999,"Chevy","Venture ""Extended Edition""",,4900.00
1999,"Chevy", "Venture ""Extended Edition, Very Large""" , , 5000.00
1996,"Jeep", "Grand Cherokee","MUST SELL!"
It will produce the following output:
Token: >>>Year<<<
Token: >>>Brand<<<
Token: >>>Model<<<
Token: >>>Description<<<
Token: >>>Price<<<
------------------------------------------
Token: >>>1997<<<
Token: >>>Ford<<<
Token: >>>E350<<<
Token: >>>ac, abs, moon<<<
Token: >>>3000.00<<<
------------------------------------------
Token: >>>1999<<<
Token: >>>Chevy<<<
Token: >>>Venture "Extended Edition"<<<
Token: >>><<<
Token: >>>4900.00<<<
------------------------------------------
Token: >>>1999<<<
Token: >>>Chevy<<<
Token: >>>Venture "Extended Edition, Very Large"<<<
Token: >>><<<
Token: >>>5000.00<<<
------------------------------------------
Token: >>>1996<<<
Token: >>>Jeep<<<
Token: >>>Grand Cherokee<<<
Token: >>>MUST SELL!<<<
------------------------------------------
And it's only supposed to give you an idea how you parse complex syntax with a state machine. This code is far from production quality and as you can see, such a switch
quickly grows huge, so I'd at least put the state code into functions or even turn every state into something like a struct or an object for data encapsulation, otherwise this whole thing soon becomes unmanageable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With