Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing an extremely simple parser

I'm writing a very basic web server that has to support an extremely limited special server side scripting language. Basically all I need to support is "echo", addition/subtraction/multiplication (no division) with only 2 operands, a simple "date()" function that outputs the date and the use of the "&" operator to concatenate strings.

An example could be:

echo "Here is the date: " & date();
echo "9 x 15 = : & 9*15;

I've gone through and created the code necessary to generate tokens, but I'm not sure I'm using the right tokens.

I created tokens for the following:

ECHO - The echo command
WHITESPACE - Any whitespace
STRING - A string inside quotations
DATE - The date() function
CONCAT - the & operator for concatenation
MATH - Any instance of binary operation (5+4, 9*2, 8-2, etc)
TERM - The terminal character (;)

The MATH one I am particularly unsure about. Typically I see people create a token specifically for integers and then for each operator as well, but since I ONLY want to allow binary operations, I thought it made sense to group it into one token. If I were to do everything separately, I would have to do some extra work to ensure that I never accepted "5+4+1".

So question 1 is am I on the right track with which tokens to use?

My next question is what to I do with these tokens next to ensure correct syntax? The approach that I had thought of was to basically say, "Okay I know I have this token, here is a list of tokens that are allowed to come next based on the current token. Is the next token in the list?"

Based on that, I made a list of all of my tokens as well as what tokens are valid to appear directly after them (didn't include whitespace for simplicity).

ECHO        ->      STRING|MATH|DATE
STRING      ->      TERM|CONCAT
MATH        ->      TERM|CONCAT
DATE        ->      TERM|CONCAT
CONCAT      ->      STRING|MATH|DATE

The problem is I'm not sure at all how to best implement this. Really I need to keep track of whitespace as well to make sure there are spaces between the tokens. But that means I have to look ahead two tokens at a time which is getting even more intimidating. I also am not sure how to manage the "valid next tokens" stuff without just some disgusting section of if blocks. Should I be checking for valid syntax before trying to actually execute the script, or should I do it all at once and just throw an error when I reach an unexpected token? In this simple example, everything will always work just fine parsing left to right, there's no real precedence rules (except the MATH thing, but that's part of why I combined it into one token even though it feels wrong.) Even so, I wouldn't mind designing a more scalable and elegant solution.

In my research about writing parsers, I see a lot of references to creating "accept()" and "expect()" functions but I can't find any clear description of what they are supposed to do or how they are supposed to work.

I guess I'm just not sure how to implement this, and then how to actually come up with a resulting string at the end of the day.

Am I heading in the right direction and does anybody know of a resource that might help me understand how to best implement something simple like this? I am required to do it by hand and cannot use a tool like ANTLR.

Thanks in advance for any help.

like image 824
ARW Avatar asked Nov 09 '12 15:11

ARW


1 Answers

The first thing that you need to do is to discard all the white-spaces (except for the ones in strings). This way, when you add tokens to the list of tokens, you are sure that the list contains only valid tokens. For example, consider this statement:

echo "Here is the date: " & date();

I will start tokenizing and first separate echo based on the white-space (yes, white-space is needed here to separate it but isn't useful after that). The tokenizer then encounters a double quote and continues reading everything until the closing double quote is found. Similarly, I create separate tokens for &, date and ().

My token list now contains the following tokens:

echo
"Here is the date: "
&
date
()

Now, in the parsing stage, we read these tokens. The parser loops through every token in the token list. It reads echo and checks if it is valid (based on the rules / functions you have for the language). It advances to the next token and sees if it is either of the date, string or math. Similarly, it checks the rest of the tokens. If at any point, a token is not supposed to be there, you can throw an error indicating syntax error or something.

For the math statement tokenization, only combine the expression that is contained in a bracket and rest of operands and operators separately. For example: 9/3 + (7-3+1) would have the tokens 9, /, 3, +, and (7-3+1). As every token has its own priority (that you define in the token struct), you can start evaluating from the highest priority token down to the lowest token priority. This way you can have prioritized expressions. If you still have confusion, let me know. I'll write you some example code.

like image 144
DelegateX Avatar answered Sep 27 '22 18:09

DelegateX