For all you compiler gurus, I wanna write a recursive descent parser and I wanna do it with just code. No generating lexers and parsers from some other grammar and don't tell me to read the dragon book, i'll come around to that eventually.
I wanna get into the gritty details about implementing a lexer and parser for a reasonable simple language, say CSS. And I wanna do this right.
This will probably end up being a series of questions but right now I'm starting with a lexer. Tokenization rules for CSS can be found here.
I find my self writing code like this (hopefully you can infer the rest from this snippet):
public CssToken ReadNext() { int val; while ((val = _reader.Read()) != -1) { var c = (char)val; switch (_stack.Top) { case ParserState.Init: if (c == ' ') { continue; // ignore } else if (c == '.') { _stack.Transition(ParserState.SubIdent, ParserState.Init); } break; case ParserState.SubIdent: if (c == '-') { _token.Append(c); } _stack.Transition(ParserState.SubNMBegin); break;
What is this called? and how far off am I from something reasonable well understood? I'm trying to balance something which is fair in terms of efficiency and easy to work with, using a stack to implement some kind of state machine is working quite well, but I'm unsure how to continue like this.
What I have is an input stream, from which I can read 1 character at a time. I don't do any look a head right now, I just read the character then depending on the current state try to do something with that.
I'd really like to get into the mind set of writing reusable snippets of code. This Transition
method is currently means to do that, it will pop the current state of the stack and then push the arguments in reverse order. That way, when I write Transition(ParserState.SubIdent, ParserState.Init)
it will "call" a sub routine SubIdent
which will, when complete, return to the Init
state.
The parser will be implemented in much the same way, currently, having everything in a single big method like this allows me to easily return a token when I found one, but it also forces me to keep everything in one single big method. Is there a nice way to split these tokenization rules into separate methods?
A handwritten parser: Writing a parser by hand is a moderately difficult task. Complexity may increase if the language-grammar is complex.
What you're writing is called a pushdown automaton. This is usually more power than you need to write a lexer, it's certainly excessive if you're writing a lexer for a modern language like CSS. A recursive descent parser is close in power to a pushdown automaton, but recursive descent parsers are much easier to write and to understand. Most parser generators generate pushdown automatons.
Lexers are almost always written as finite state machines, i.e., like your code except get rid of the "stack" object. Finite state machines are closely related to regular expressions (actually, they're provably equivalent to one another). When designing such a parser, one usually starts with the regular expressions and uses them to create a deterministic finite automaton, with some extra code in the transitions to record the beginning and end of each token.
There are tools to do this. The lex
tool and its descendants are well known and have been translated into many languages. The ANTLR
toolchain also has a lexer component. My preferred tool is ragel
on platforms that support it. There is little benefit to writing a lexer by hand most of the time, and the code generated by these tools will probably be faster and more reliable.
If you do want to write your own lexer by hand, good ones often look something like this:
function readToken() // note: returns only one token each time while !eof c = peekChar() if c in A-Za-z return readIdentifier() else if c in 0-9 return readInteger() else if c in ' \n\r\t\v\f' nextChar() ... return EOF function readIdentifier() ident = "" while !eof c = nextChar() if c in A-Za-z0-9 ident.append(c) else return Token(Identifier, ident) // or maybe... return Identifier(ident)
Then you can write your parser as a recursive descent parser. Don't try to combine lexer / parser stages into one, it leads to a total mess of code. (According to the Parsec author, it's slower, too).
You need to write your own Recursive Descent Parser from your BNF/EBNF. I had to write my own recently and this page was a lot of help. I'm not sure what you mean by "with just code". Do you mean you want to know how to write your own recursive parser?
If you want to do that, you need to have your grammar in place first. Once you have your EBNF/BNF in place, the parser can be written quite easily from it.
The first thing I did when I wrote my parser, was to read everything in and then tokenize the text. So I essentially ended up with an array of tokens that I treated as a stack. To reduce the verbosity/overhead of pulling a value off a stack and then pushing it back on if you don't require it, you can have a peek
method that simply returns the top value on the stack without popping it.
UPDATE
Based on your comment, I had to write a recursive-descent parser in Javascript from scratch. You can take a look at the parser here. Just search for the constraints
function. I wrote my own tokenize
function to tokenize the input as well. I also wrote another convenience function (peek
, that I mentioned before). The parser parses according to the EBNF here.
This took me a little while to figure out because it's been years since I wrote a parser (last time I wrote it was in school!), but trust me, once you get it, you get it. I hope my example gets your further along on your way.
ANOTHER UPDATE
I also realized that my example may not be what you want because you might be going towards using a shift-reduce parser. You mentioned that right now you are trying to write a tokenizer. In my case, I did write my own tokenizer in Javascript. It's probably not robust, but it was sufficient for my needs.
function tokenize(options) { var str = options.str; var delimiters = options.delimiters.split(""); var returnDelimiters = options.returnDelimiters || false; var returnEmptyTokens = options.returnEmptyTokens || false; var tokens = new Array(); var lastTokenIndex = 0; for(var i = 0; i < str.length; i++) { if(exists(delimiters, str[i])) { var token = str.substring(lastTokenIndex, i); if(token.length == 0) { if(returnEmptyTokens) { tokens.push(token); } } else { tokens.push(token); } if(returnDelimiters) { tokens.push(str[i]); } lastTokenIndex = i + 1; } } if(lastTokenIndex < str.length) { var token = str.substring(lastTokenIndex, str.length); token = token.replace(/^\s+/, "").replace(/\s+$/, ""); if(token.length == 0) { if(returnEmptyTokens) { tokens.push(token); } } else { tokens.push(token); } } return tokens; }
Based on your code, it looks like you are reading, tokenizing, and parsing at the same time - I'm assuming that's what a shift-reduce parser does? The flow for what I have is tokenize first to build the stack of tokens, and then send the tokens through the recursive-descent parser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With