I'm writing a small interpreter for a simple BASIC like language as an exercise on an AVR microcontroller in C using the avr-gcc toolchain. If I were writing this to run on my Linux box, I could use flex/bison. Now that I restricted myself to an 8-bit platform, how would I code the parser?

I've implemented a parser for a simple command language targeted for the ATmega328p. This chip has 32k ROM and only 2k RAM. The RAM is definitely the more important limitation -- if you aren't tied to a particular chip yet, pick one with as much RAM as possible. This will make your life much easier. At first I considered using flex/bison. I decided against this option for two major reasons: <ul> <li>By default, Flex & Bison depend on some standard library functions (especially for I/O) that aren't available or don't work the same in avr-libc. I'm pretty sure there are supported workarounds, but this is some extra effort that you will need to take into account.</li> <li>AVR has a Harvard Architecture. C isn't designed to account for this, so even constant variables are loaded into RAM by default. You have to use special macros/functions to store and access data in flash and EEPROM. Flex & Bison create some relatively large lookup tables, and these will eat up your RAM pretty quickly. Unless I'm mistaken (which is quite possible) you will have to edit the output source in order to take advantage of the special Flash & EEPROM interfaces.</li> </ul> After rejecting Flex & Bison, I went looking for other generator tools. Here are a few that I considered: <ul> <li>LEMON</li> <li>Ragel</li> <li>re2c</li> </ul> You might also want to take a look at Wikipedia's comparison. Ultimately, I ended up hand coding both the lexer and parser. For parsing I used a recursive descent parser. I think Ira Baxter has already done an adequate job of covering this topic, and there are plenty of tutorials online. For my lexer, I wrote up regular expressions for all of my terminals, diagrammed the equivalent state machine, and implemented it as one giant function using <code>goto</code>'s for jumping between states. This was tedious, but the results worked great. As an aside, <code>goto</code> is a great tool for implementing state machines -- all of your states can have clear labels right next to the relevant code, there is no function call or state variable overhead, and it's about as fast as you can get. C really doesn't have a better construct for building static state machines. Something to think about: lexers are really just a specialization of parsers. The biggest difference is that regular grammars are usually sufficient for lexical analysis, whereas most programming languages have (mostly) context-free grammars. So there's really nothing stopping you from implementing a lexer as a recursive descent parser or using a parser generator to write a lexer. It's just not usually as convenient as using a more specialized tool.

Is there an alternative for flex/bison that is usable on 8-bit embedded systems?

2 Answers

If you want an easy way to code parsers, or you are tight on space, you should hand-code a recursive descent parser; these are essentially LL(1) parsers. This is especially effective for languages which are as "simple" as Basic. (I did several of these back in the 70s!). The good news is these don't contain any library code; just what you write.

They are pretty easy to code, if you already have a grammar. First, you have to get rid of left recursive rules (e.g., X = X Y ). This is generally pretty easy to do, so I leave it as an exercise. (You don't have to do this for list-forming rules; see discussion below).

Then if you have BNF rule of the form:

 X = A B C ;

create a subroutine for each item in the rule (X, A, B, C) that returns a boolean saying "I saw the corresponding syntax construct". For X, code:

subroutine X()      if ~(A()) return false;      if ~(B()) { error(); return false; }      if ~(C()) { error(); return false; }      // insert semantic action here: generate code, do the work, ....      return true; end X;

Similarly for A, B, C.

If a token is a terminal, write code that checks the input stream for the string of characters that makes up the terminal. E.g, for a Number, check that input stream contains digits and advance the input stream cursor past the digits. This is especially easy if you are parsing out of a buffer (for BASIC, you tend to get one line at time) by simply advancing or not advancing a buffer scan pointer. This code is essentially the lexer part of the parser.

If your BNF rule is recursive... don't worry. Just code the recursive call. This handles grammar rules like:

T  =  '('  T  ')' ;

This can be coded as:

subroutine T()      if ~(left_paren()) return false;      if ~(T()) { error(); return false; }      if ~(right_paren()) { error(); return false; }      // insert semantic action here: generate code, do the work, ....      return true; end T;

If you have a BNF rule with an alternative:

 P = Q | R ;

then code P with alternative choices:

subroutine P()     if ~(Q())         {if ~(R()) return false;          return true;         }     return true; end P;

Sometimes you'll encounter list forming rules. These tend to be left recursive, and this case is easily handled. The basic idea is to use iteration rather than recursion, and that avoids the infinite recursion you would get doing this the "obvious" way. Example:

L  =  A |  L A ;

You can code this using iteration as:

subroutine L()     if ~(A()) then return false;     while (A()) do { /* loop */ }     return true; end L;

You can code several hundred grammar rules in a day or two this way. There's more details to fill in, but the basics here should be more than enough.

If you are really tight on space, you can build a virtual machine that implements these ideas. That's what I did back in 70s, when 8K 16 bit words was what you could get.

If you don't want to code this by hand, you can automate it with a metacompiler (Meta II) that produces essentially the same thing. These are mind-blowing technical fun and really takes all the work out of doing this, even for big grammars.

August 2014:

I get a lot of requests for "how to build an AST with a parser". For details on this, which essentially elaborates this answer, see my other SO answer https://stackoverflow.com/a/25106688/120163

July 2015:

There are lots of folks what want to write a simple expression evaluator. You can do this by doing the same kinds of things that the "AST builder" link above suggests; just do arithmetic instead of building tree nodes. Here's an expression evaluator done this way.

October 2021:

Its worth noting that this kind of parser works when your language doesn't have complications that recursive descent doesn't handle well. I offer two kinds of complications: a) genuinely ambiguous parses (e.g., more than one way to parse a phrase) and b) arbitrarily long lookahead (e.g., not bounded by a constant). In these cases recursive descent turns into recursive descent into hell, and its time to get a parser generator that can handle them. See my bio for a system that uses GLR parser generators to handle over 50 different languages, including all these complications even to the point of ridiculousness.

answered Sep 18 '22 17:09

Ira Baxter

I've implemented a parser for a simple command language targeted for the ATmega328p. This chip has 32k ROM and only 2k RAM. The RAM is definitely the more important limitation -- if you aren't tied to a particular chip yet, pick one with as much RAM as possible. This will make your life much easier.

At first I considered using flex/bison. I decided against this option for two major reasons:

By default, Flex & Bison depend on some standard library functions (especially for I/O) that aren't available or don't work the same in avr-libc. I'm pretty sure there are supported workarounds, but this is some extra effort that you will need to take into account.
AVR has a Harvard Architecture. C isn't designed to account for this, so even constant variables are loaded into RAM by default. You have to use special macros/functions to store and access data in flash and EEPROM. Flex & Bison create some relatively large lookup tables, and these will eat up your RAM pretty quickly. Unless I'm mistaken (which is quite possible) you will have to edit the output source in order to take advantage of the special Flash & EEPROM interfaces.

After rejecting Flex & Bison, I went looking for other generator tools. Here are a few that I considered:

LEMON
Ragel
re2c

You might also want to take a look at Wikipedia's comparison.

Ultimately, I ended up hand coding both the lexer and parser.

For parsing I used a recursive descent parser. I think Ira Baxter has already done an adequate job of covering this topic, and there are plenty of tutorials online.

For my lexer, I wrote up regular expressions for all of my terminals, diagrammed the equivalent state machine, and implemented it as one giant function using goto's for jumping between states. This was tedious, but the results worked great. As an aside, goto is a great tool for implementing state machines -- all of your states can have clear labels right next to the relevant code, there is no function call or state variable overhead, and it's about as fast as you can get. C really doesn't have a better construct for building static state machines.

Something to think about: lexers are really just a specialization of parsers. The biggest difference is that regular grammars are usually sufficient for lexical analysis, whereas most programming languages have (mostly) context-free grammars. So there's really nothing stopping you from implementing a lexer as a recursive descent parser or using a parser generator to write a lexer. It's just not usually as convenient as using a more specialized tool.

answered Sep 19 '22 17:09

Steve S

Related questions
                            
                                Using Joda Date & Time API to parse multiple formats
                            
                                How to check that a string is parseable to a double? [duplicate]
                            
                                Evaluating a string of simple mathematical expressions [closed]
                            
                                Parsing PDF files (especially with tables) with PDFBox
                            
                                std::lexical_cast - is there such a thing?
                            
                                How can I fix MySQL error #1064?
                            
                                Writing a parser for regular expressions
                            
                                Why does javascript accept commas in if statements?
                            
                                How do I submit a boolean parameter in Rails?
                            
                                Parsing JSON in Excel VBA
                            
                                What is parsing in terms that a new programmer would understand? [closed]
                            
                                Any decent PHP parser written in PHP? [closed]
                            
                                Double.TryParse or Convert.ToDouble - which is faster and safer?
                            
                                How to do an Integer.parseInt() for a decimal number?
                            
                                How to convert an XML file to nice pandas dataframe?
                            
                                Only extracting text from this element, not its children
                            
                                Is there a built-in or more Pythonic way to try to parse a string to an integer
                            
                                How to parse XML using vba
                            
                                PDF Parsing Using Python - extracting formatted and plain texts [closed]
                            
                                Extracting Path from OpenFileDialog path/filename

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there an alternative for flex/bison that is usable on 8-bit embedded systems?

Tags:

parsing

embedded

avr-gcc

bison

flex-lexer

Johan

People also ask

2 Answers

Ira Baxter

Steve S

Recent Activity

Donate For Us