I would like to parse a self-designed file format with a FSM-like parser in C++ (this is a teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult
kind of project :)). I have a tokenized string with newlines signifying the end of a euh... line. See here for an input example. All the comments will and junk is filtered out, so I have a std::string like this:
global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...
Syntax explanation:
So I thought that a FSM would be simple/extensible enough for my needs/knowledge. As far as I can tell (and want my file design to be), I don't need concurrent states or anything fancy like that. Some design/implementation questions:
enum
or an abstract class
+ derivatives for my states? The first is probably better for small syntax, but could get ugly later, and the second is the exact opposite. I'm leaning to the first, for its simplicity. enum
example and class example. EDIT: what about this suggestion for goto
, I thought they were evil in C++?\n
. My preferred way of using the string
via stringstream
, will ignore \n
by default. So I need simple way of telling (the same!) stringstream
to not ignore newlines when a certain state is enabled.enum
states suffice for multi-level parsing (scopes within scopes {...{...}...}
) or would that need hacky implementations?upper
: reads global, exe, lib+ target names...normal
: inside a scope, can read SOURCES..., create user variables...list
: adds items to a list until a newline is encountered.Each scope will have a kind of conditional (e.g. win32:global { gcc:CFLAGS = ... }) and will need to be handled in the exact same fashion eveywhere (even in the list
state, per item).
Thanks for any input.
A parser is a specialized type of state machine that analyzes the components and meaning of structured texts. Generally a parser is accompanied by its own high-level description language that describes the states and transitions used by the implied state machine.
A finite state machine is a machine that can, at any point in time, be in a specific state from a finite set of possible states. It can move (transition) to another state by accepting an input. If the machine allows for outputs, it can produce an output.
A system where particular inputs cause particular changes in state can be represented using finite state machines. This example describes the various states of a turnstile. Inserting a coin into a turnstile will unlock it, and after the turnstile has been pushed, it locks again.
If you have nesting scopes, then a Finite State Machine is not the right way to go, and you should look at a Context Free Grammar parser. An LL(1) parser can be written as a set of recursive funcitons, or an LALR(1) parser can be written using a parser generator such as Bison.
If you add a stack to an FSM, then you're getting into pushdown automaton territory. A nondeterministic pushdown automaton is equivalent to a context free grammar (though a deterministic pushdown automaton is strictly less powerful.) LALR(1) parser generators actually generate a deterministic pushdown automaton internally. A good compiler design textbook will cover the exact algorithm by which the pushdown automaton is constructed from the grammar. (In this way, adding a stack isn't "hacky".) This Wikipedia article also describes how to construct the LR(1) pushdown automaton from your grammar, but IMO, the article is not as clear as it could be.
If your scopes nest only finitely deep (i.e. you have the upper
, normal
and list
levels but you don't have nested list
s or nested normal
s), then you can use a FSM without a stack.
There are two stages to analyzing a text input stream for parsing:
Lexical Analysis: This is where your input stream is broken into lexical units. It looks at a sequence of characters and generates tokens (analagous to word in spoken or written languages). Finite state machines are very good at lexical analysis provided you've made good design decision about the lexical structure. From your data above, individal lexemes would be things like your keywords (e.g. "global"), identifiers (e.g. "bitwise", "SOURCES"), symbolic tokesn (e.g. "{" "}", ".", "/"), numeric values, escape values (e.g. "\n"), etc.
Syntactic / Grammatic Analysis: Upon generating a sequence of tokens (or perhaps while you're doing so) you need to be able to analyze the structure to determine if the sequence of tokens is consistent with your language design. You generally need some sort of parser for this, though if the language structure is not very complicated, you may be able to do it with a finite state machine instead. In general (and since you want nesting structures in your case in particular) you will need to use one of the techniques Ken Bloom describes.
So in response to your questions:
Should I use an enum or an abstract class + derivatives for my states?
I found that for small tokenizers, a matrix of state / transition values is suitable, something like next_state = state_transitions[current_state][current_input_char]
. In this case, the next_state
and current_state
are some integer types (including possibly an enumerated type). Input errors are detected when you transition to an invalid state. The end of an token is identified based on the state identification of valid endstates with no valid transition available to another state given the next input character. If you're concerned about space, you could use a vector of maps instead. Making the states classes is possible, but I think that's probably making thing more difficult than you need.
When reading a list, I need to NOT ignore \n.
You can either create a token called "\n", or a more generalize escape token (an identifier preceded by a backslash. If you're talking about identifying line breaks in the source, then those are simply characters you need to create transitions for in your state transition matrix (be aware of the differnce between Unix and Windows line breaks, however; you could create a FSM that operates on either).
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
This is where you will need a grammar or pushdown automaton unless you can guarantee that the nesting will not exceed a certain level. Even then, it will likely make your FSM very complex.
Here's the draft states I have in mind: ...
See my commments on lexical and grammatical analysis above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With