I would like to parse a self-designed file format with a FSM-like parser in C++ (this is a <code>teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult</code> kind of project :)). I have a tokenized string with newlines signifying the end of a euh... line. See here for an input example. All the comments will and junk is filtered out, so I have a std::string like this: <pre class="prettyprint"><code>global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ... </code></pre> Syntax explanation: <ul> <li>{ } are scopes, and capitalized words signify that a list of options/files is to follow.</li> <li>\n are only important in a list of options/files, signifying the end of the list.</li> </ul> So I thought that a FSM would be simple/extensible enough for my needs/knowledge. As far as I can tell (and want my file design to be), I don't need concurrent states or anything fancy like that. Some design/implementation questions: <ol> <li>Should I use an <code>enum</code> or an abstract <code>class</code> + derivatives for my states? The first is probably better for small syntax, but could get ugly later, and the second is the exact opposite. I'm leaning to the first, for its simplicity. <code>enum</code> example and class example. EDIT: what about this suggestion for <code>goto</code>, I thought they were evil in C++?</li> <li>When reading a list, I need to NOT ignore <code>\n</code>. My preferred way of using the <code>string</code> via <code>stringstream</code>, will ignore <code>\n</code> by default. So I need simple way of telling (the same!) <code>stringstream</code> to not ignore newlines when a certain state is enabled.</li> <li>Will the simple <code>enum</code> states suffice for multi-level parsing (scopes within scopes <code>{...{...}...}</code>) or would that need hacky implementations?</li> <li>Here's the draft states I have in mind: <ul> <li> <code>upper</code>: reads global, exe, lib+ target names...</li> <li> <code>normal</code>: inside a scope, can read SOURCES..., create user variables...</li> <li> <code>list</code>: adds items to a list until a newline is encountered.</li> </ul> </li> </ol> Each scope will have a kind of conditional (e.g. win32:global { gcc:CFLAGS = ... }) and will need to be handled in the exact same fashion eveywhere (even in the <code>list</code> state, per item). Thanks for any input.

There are two stages to analyzing a text input stream for parsing: Lexical Analysis: This is where your input stream is broken into lexical units. It looks at a sequence of characters and generates tokens (analagous to word in spoken or written languages). Finite state machines are very good at lexical analysis provided you've made good design decision about the lexical structure. From your data above, individal lexemes would be things like your keywords (e.g. "global"), identifiers (e.g. "bitwise", "SOURCES"), symbolic tokesn (e.g. "{" "}", ".", "/"), numeric values, escape values (e.g. "\n"), etc. Syntactic / Grammatic Analysis: Upon generating a sequence of tokens (or perhaps while you're doing so) you need to be able to analyze the structure to determine if the sequence of tokens is consistent with your language design. You generally need some sort of parser for this, though if the language structure is not very complicated, you may be able to do it with a finite state machine instead. In general (and since you want nesting structures in your case in particular) you will need to use one of the techniques Ken Bloom describes. So in response to your questions: Should I use an enum or an abstract class + derivatives for my states? I found that for small tokenizers, a matrix of state / transition values is suitable, something like <code>next_state = state_transitions[current_state][current_input_char]</code>. In this case, the <code>next_state</code> and <code>current_state</code> are some integer types (including possibly an enumerated type). Input errors are detected when you transition to an invalid state. The end of an token is identified based on the state identification of valid endstates with no valid transition available to another state given the next input character. If you're concerned about space, you could use a vector of maps instead. Making the states classes is possible, but I think that's probably making thing more difficult than you need. When reading a list, I need to NOT ignore \n. You can either create a token called "\n", or a more generalize escape token (an identifier preceded by a backslash. If you're talking about identifying line breaks in the source, then those are simply characters you need to create transitions for in your state transition matrix (be aware of the differnce between Unix and Windows line breaks, however; you could create a FSM that operates on either). Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations? This is where you will need a grammar or pushdown automaton unless you can guarantee that the nesting will not exceed a certain level. Even then, it will likely make your FSM very complex. Here's the draft states I have in mind: ... See my commments on lexical and grammatical analysis above.

Finite State Machine parser

Tags:

c++

stream

parsing

fsm

I would like to parse a self-designed file format with a FSM-like parser in C++ (this is a teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult kind of project :)). I have a tokenized string with newlines signifying the end of a euh... line. See here for an input example. All the comments will and junk is filtered out, so I have a std::string like this:

global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...

Syntax explanation:

{ } are scopes, and capitalized words signify that a list of options/files is to follow.
\n are only important in a list of options/files, signifying the end of the list.

So I thought that a FSM would be simple/extensible enough for my needs/knowledge. As far as I can tell (and want my file design to be), I don't need concurrent states or anything fancy like that. Some design/implementation questions:

Should I use an enum or an abstract class + derivatives for my states? The first is probably better for small syntax, but could get ugly later, and the second is the exact opposite. I'm leaning to the first, for its simplicity. enum example and class example. EDIT: what about this suggestion for goto, I thought they were evil in C++?
When reading a list, I need to NOT ignore \n. My preferred way of using the string via stringstream, will ignore \n by default. So I need simple way of telling (the same!) stringstream to not ignore newlines when a certain state is enabled.
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
Here's the draft states I have in mind:
- upper: reads global, exe, lib+ target names...
- normal: inside a scope, can read SOURCES..., create user variables...
- list: adds items to a list until a newline is encountered.

Each scope will have a kind of conditional (e.g. win32:global { gcc:CFLAGS = ... }) and will need to be handled in the exact same fashion eveywhere (even in the list state, per item).

Thanks for any input.

228

asked Jun 21 '10 13:06

rubenvb

2 Answers

If you have nesting scopes, then a Finite State Machine is not the right way to go, and you should look at a Context Free Grammar parser. An LL(1) parser can be written as a set of recursive funcitons, or an LALR(1) parser can be written using a parser generator such as Bison.

If you add a stack to an FSM, then you're getting into pushdown automaton territory. A nondeterministic pushdown automaton is equivalent to a context free grammar (though a deterministic pushdown automaton is strictly less powerful.) LALR(1) parser generators actually generate a deterministic pushdown automaton internally. A good compiler design textbook will cover the exact algorithm by which the pushdown automaton is constructed from the grammar. (In this way, adding a stack isn't "hacky".) This Wikipedia article also describes how to construct the LR(1) pushdown automaton from your grammar, but IMO, the article is not as clear as it could be.

If your scopes nest only finitely deep (i.e. you have the upper, normal and list levels but you don't have nested lists or nested normals), then you can use a FSM without a stack.

161

answered Sep 28 '22 08:09

Ken Bloom

There are two stages to analyzing a text input stream for parsing:

Lexical Analysis: This is where your input stream is broken into lexical units. It looks at a sequence of characters and generates tokens (analagous to word in spoken or written languages). Finite state machines are very good at lexical analysis provided you've made good design decision about the lexical structure. From your data above, individal lexemes would be things like your keywords (e.g. "global"), identifiers (e.g. "bitwise", "SOURCES"), symbolic tokesn (e.g. "{" "}", ".", "/"), numeric values, escape values (e.g. "\n"), etc.

Syntactic / Grammatic Analysis: Upon generating a sequence of tokens (or perhaps while you're doing so) you need to be able to analyze the structure to determine if the sequence of tokens is consistent with your language design. You generally need some sort of parser for this, though if the language structure is not very complicated, you may be able to do it with a finite state machine instead. In general (and since you want nesting structures in your case in particular) you will need to use one of the techniques Ken Bloom describes.

So in response to your questions:

Should I use an enum or an abstract class + derivatives for my states?

I found that for small tokenizers, a matrix of state / transition values is suitable, something like next_state = state_transitions[current_state][current_input_char]. In this case, the next_state and current_state are some integer types (including possibly an enumerated type). Input errors are detected when you transition to an invalid state. The end of an token is identified based on the state identification of valid endstates with no valid transition available to another state given the next input character. If you're concerned about space, you could use a vector of maps instead. Making the states classes is possible, but I think that's probably making thing more difficult than you need.

When reading a list, I need to NOT ignore \n.

You can either create a token called "\n", or a more generalize escape token (an identifier preceded by a backslash. If you're talking about identifying line breaks in the source, then those are simply characters you need to create transitions for in your state transition matrix (be aware of the differnce between Unix and Windows line breaks, however; you could create a FSM that operates on either).

Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?

This is where you will need a grammar or pushdown automaton unless you can guarantee that the nesting will not exceed a certain level. Even then, it will likely make your FSM very complex.

Here's the draft states I have in mind: ...

See my commments on lexical and grammatical analysis above.

answered Sep 28 '22 08:09

andand

Related questions
                            
                                Why use Web Services instead of RPC between two internal processes?
                            
                                Other's library #define naming conflict
                            
                                C++ rely on implicit conversion to bool in conditions?
                            
                                Conflict between a namespace and a define
                            
                                Defining a variable inside c++ inline assembly
                            
                                openGL and STL?
                            
                                if given a 15 digit number whats the best way to find the next palindrome?
                            
                                Strange gcc error: stray '\NNN' in program
                            
                                Get text width in MFC
                            
                                Does defining a function inside a header always make the compiler treat it as inline?
                            
                                Better boost asio deadline_timer example
                            
                                Socket select() works in Windows and times out in Linux
                            
                                Advantages/disadvantages of auto pointers
                            
                                Why have a pointer to a pointer (int **a)?
                            
                                How to use ALSA's snd_pcm_writei()?
                            
                                Are functions defined in headers guaranteed to be inlined?
                            
                                Basic question on c++ header file inclusion?
                            
                                Example applications and benefits of using C , C++ or Java [closed]
                            
                                Problem with include guard
                            
                                Signals and threads - good or bad design decision?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With