Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to write a syntax tokenizer/parser in C? [closed]

Background Information:
I have a desire to make a programming language, knowing the tools to do so, I don't have any good examples on how to use them. I really do not want to use Flex or Bison as it doesn't teach the abstractness that I feel is needed for creating a compiler. I have the concepts of creating strings, tokenizing them, feeding them to a file that acts as grammar and parses eventually creating an actual program to run the language. The problem is, I don't know how to write a tokenizer or parser. I have general ideas, but I understand better when I can see examples. If someone could post an/few example(s), that would be great!

My question is as follows:
Can someone post examples of how to do write a syntax tokenizer/parser in C?

like image 295
Sora Avatar asked Dec 24 '22 23:12

Sora


1 Answers

If you want to write a very complex syntax parser in C, without using any existing pattern matching code, it's usually best to implement a state machine and then process source code char by char.

The output of Flex+Bison is also just a state machine. Flex uses regular expressions to tokenize a string into tokens that are then passed to a Bison state machine, processing one token after another, depending on the current state of the machine. But you don't need a regex tokenizer, you can tokenize the input as part of the state machine processing. A regex matcher itself can also be implemented as a state machine, so token generation can be directly part of your state machine.

Here's an interesting link for you; it's not C in particular, more a general overview how state machines work, but once you got the concept, it's easy to transpose that to C code:

Parsing command line arguments using a finite state machine and backtracking

Here's some sample code of a super primitive CSV parser:

#include <stdlib.h>
#include <stdio.h>

static char currentToken[4096];
static size_t currentTokenLength;

static
void addCharToCurrentToken ( char c ) {
    if (currentTokenLength < sizeof(currentToken)) {
        currentToken[currentTokenLength++] = c;
    }
}

static
void printCurrentToken ( ) {
    printf("Token: >>>%.*s<<<\n", (int)currentTokenLength, currentToken);
    currentTokenLength = 0;
}


typedef enum {
    STATE_FindStartOfData,
    STATE_FindStartOfToken,
    STATE_ParseNumber,
    STATE_ParseString,
    STATE_CheckEndOfString,
    STATE_FindDelimiter,
    STATE_ParseError,
    STATE_EndOfData
} ParserState;


ParserState parserState = STATE_FindStartOfData;


static
void runTheStateMachine ( ) {
    while (parserState != STATE_ParseError
            && parserState != STATE_EndOfData
    ) {
        int c = fgetc(stdin);
        // End of data?
        if (c == -1) {
            switch (parserState) {
                case STATE_ParseNumber:
                case STATE_CheckEndOfString:
                    printCurrentToken();
                    parserState = STATE_EndOfData;
                    break;

                case STATE_ParseString:
                    // Data ends in the middle of token parsing? No way!
                    fprintf(stderr, "Data ended abruptly!\n");
                    parserState = STATE_ParseError;
                    break;

                case STATE_FindStartOfData:
                case STATE_FindStartOfToken:
                case STATE_FindDelimiter:
                    // This is okay, data stream may end while in these states
                    parserState = STATE_EndOfData;
                    break;

                case STATE_ParseError:
                case STATE_EndOfData:
                    break;
            }
        }

        switch (parserState) {
                case STATE_FindStartOfData:
                    // Skip blank lines
                    if (c == '\n' || c == '\r') break;
                    // !!!FALLTHROUGH!!!

                case STATE_FindStartOfToken:
                    // Skip overe all whitespace
                    if (c == ' ' || c == '\t') break;
                    // Start of string?
                    if (c == '"') {
                        parserState = STATE_ParseString;
                        break;
                    }
                    // Blank field?
                    if (c == ',') {
                        printCurrentToken();
                        break;
                    }
                    // End of dataset?
                    if (c == '\n' || c == '\r') {
                        printf("------------------------------------------\n");
                        parserState = STATE_FindStartOfData;
                        break;
                    }
                    // Everything else can only be a number
                    parserState = STATE_ParseNumber;
                    addCharToCurrentToken(c);
                    break;

                case STATE_ParseNumber:
                    if (c == ' ' || c == '\t') {
                        // Numbers cannot contain spaces in the middle,
                        // so this must be the end of the number.
                        printCurrentToken();
                        // We still need to find the real delimiter, though.
                        parserState = STATE_FindDelimiter;
                        break;
                    }
                    if (c == ',') {
                        // This time the number ends directly with a delimiter
                        printCurrentToken();
                        parserState = STATE_FindStartOfToken;
                        break;
                    }
                    // End of dataset?
                    if (c == '\n' || c == '\r') {
                        printCurrentToken();
                        printf("------------------------------------------\n");
                        parserState = STATE_FindStartOfData;
                        break;
                    }
                    // Otherwise keep reading the number
                    addCharToCurrentToken(c);
                    break;

                case STATE_ParseString:
                    if (c == '"') {
                        // Either this is the regular end of the string or it is just an
                        // escaped quotation mark which is doubled ("") in CVS.
                        parserState = STATE_CheckEndOfString;
                        break;
                    }
                    // All other chars are just treated as ordinary chars
                    addCharToCurrentToken(c);
                    break;

                case STATE_CheckEndOfString:
                    if (c == '"') {
                        // Next char is also a quotation mark,
                        // so this was not the end of the string.
                        addCharToCurrentToken(c);
                        parserState = STATE_ParseString;
                        break;
                    }
                    if (c == ' ' || c == '\t') {
                        // It was the end of the string
                        printCurrentToken();
                        // We still need to find the real delimiter, though.
                        parserState = STATE_FindDelimiter;
                        break;
                    }
                    if (c == ',') {
                        // It was the end of the string
                        printCurrentToken();
                        // And we even found the delimiter
                        parserState = STATE_FindStartOfToken;
                        break;
                    }
                    if (c == '\n' || c == '\r') {
                        // It was the end of the string
                        printCurrentToken();
                        // And we even found the end of this dataset
                        printf("------------------------------------------\n");
                        parserState = STATE_FindStartOfData;
                        break;
                    }
                    // Everything else is a parse error I guess
                    fprintf(stderr, "Unexpected char 0x%02X after end of string!\n", c);
                    parserState = STATE_ParseError;
                    break;

                case STATE_FindDelimiter:
                    // Delemiter found?
                    if (c == ',') {
                        parserState = STATE_FindStartOfToken;
                        break;
                    }
                    // Just skip overe all whitespace
                    if (c == ' ' || c == '\t') break;
                    // End of dataset?
                    if (c == '\n' || c == '\r') {
                        // And we even found the end of this dataset
                        printf("------------------------------------------\n");
                        parserState = STATE_FindStartOfData;
                        break;
                    }
                    // Anything else a pare error I guess
                    fprintf(stderr, "Unexpected char 0x%02X after end of token!\n", c);
                    parserState = STATE_ParseError;
                    break;

                case STATE_ParseError:
                    // Nothing to do
                    break;

                case STATE_EndOfData:
                    // Nothing to do
                    break;
        }
    }
}

int main ( ) {
    runTheStateMachine();
    return (parserState == STATE_EndOfData ? 0 : 1);
}

The code makes the following assumptions:

  • Tokens are never bigger than 4096 chars.
  • The separator character is a comma
    (that's what CVS implies but not all CVS files use comma for that purpose)
  • Strings are always quoted
    (usually this is optional unless they contain spaces or quotation marks)
  • There are no line breaks inside quoted strings
    (this is usually allowed)
  • The code assumes that everything that is not quoted is a number but it won't verify that the format of the number is correct.

This code can definitely not parse any CSV data you feed it, but when you feed it that file:

"Year","Brand","Model"   ,"Description",  "Price"
    1997,"Ford", "E350","ac, abs, moon", 3000.00
1999,"Chevy","Venture ""Extended Edition""",,4900.00
 1999,"Chevy",     "Venture ""Extended Edition, Very Large"""  ,  , 5000.00
1996,"Jeep", "Grand Cherokee","MUST SELL!"

It will produce the following output:

Token: >>>Year<<<
Token: >>>Brand<<<
Token: >>>Model<<<
Token: >>>Description<<<
Token: >>>Price<<<
------------------------------------------
Token: >>>1997<<<
Token: >>>Ford<<<
Token: >>>E350<<<
Token: >>>ac, abs, moon<<<
Token: >>>3000.00<<<
------------------------------------------
Token: >>>1999<<<
Token: >>>Chevy<<<
Token: >>>Venture "Extended Edition"<<<
Token: >>><<<
Token: >>>4900.00<<<
------------------------------------------
Token: >>>1999<<<
Token: >>>Chevy<<<
Token: >>>Venture "Extended Edition, Very Large"<<<
Token: >>><<<
Token: >>>5000.00<<<
------------------------------------------
Token: >>>1996<<<
Token: >>>Jeep<<<
Token: >>>Grand Cherokee<<<
Token: >>>MUST SELL!<<<
------------------------------------------

And it's only supposed to give you an idea how you parse complex syntax with a state machine. This code is far from production quality and as you can see, such a switch quickly grows huge, so I'd at least put the state code into functions or even turn every state into something like a struct or an object for data encapsulation, otherwise this whole thing soon becomes unmanageable.

like image 87
Mecki Avatar answered Jan 18 '23 23:01

Mecki