Learning bison: What are context-free grammars and LALR(1)?

Tags:

I am reading this bison introduction.

I have two questions and it will be great if someone can help me understand:

What does term context free grammar mean?
From the link above: Not all context-free languages can be handled by Bison, only those that are LALR(1). In brief, this means that it must be possible to tell how to parse any portion of an input string with just a single token of look-ahead. What does it mean by "possible to tell how to parse any portion of an input string with just a single token of look-ahead`"

660

asked Aug 24 '11 16:08

hari

1 Answers

A context-free grammar is a description of a set of strings that is strictly more powerful than the regular expressions, but still easily handled by a machine. More formally, a context-free grammar consists of four things:

A set of terminal symbols, which are the elements of the strings being produced. For a bison parser, this is typically the set of the tokens produced by the scanner. For a grammar for a natural language like English, this might be the set of all English words.
A set of nonterminal symbols. Intuitively, a nonterminal symbol represents something like a part of speech, such as "noun" or "verb."
A set of productions. Each production says how to replace a nonterminal symbol with some other set of terminals and nonterminals. For example, the production Variable -> Type Name says that if we see the nonterminal Variable, we may replace it with the string Type Name.
A start symbol, which is the nonterminal from which the derivation begins.

As an example, consider this simple context-free grammar for C-style function declarations:

Function -> Type ident(Arguments)
Type -> int
Type -> Type *
Arguments -> e   (the empty string)
Arguments -> ArgList
ArgList -> Type ident
ArgList -> Type ident, ArgList

Here, the start symbol is Function. Given this grammar, we can produce C-style function declarations by repeatedly picking a nonterminal symbol and replacing it with one of the right-hand sides of a matching production. At each step, the string we've built so far is called a sentential form. For example, here are some different function declarations that can be derived from the above grammar:

Sentential Form                       Production
-------------------------------------------------------------------
Function                              Function -> Type ident(Arguments)
Type ident(Arguments)                 Type -> int
int ident(Arguments)                  Arguments -> e
int ident()

Sentential Form                       Production
-------------------------------------------------------------------
Function                              Function -> Type ident(Arguments)
Type ident(Arguments)                 Type -> Type*
Type* ident(Arguments)                Type -> int
int* ident(Arguments)                 Arguments -> ArgList
int* ident(ArgList)                   ArgList -> Type ident, ArgList
int* ident(Type ident, ArgList)       ArgList -> Type ident
int* ident(Type ident, Type ident)    Type -> Type*
int* ident(Type* ident, Type ident)   Type -> Type*
int* ident(Type** ident, Type ident)  Type -> int
int* ident(int** ident, Type ident)   Type -> int
int* ident(int** ident, int ident)

Most programming languages have a structure that can be described by a context-free grammar. The job of the parser is to take a program and a grammar and to determine how that program can be generated by the grammar.

As for LALR(1), unfortunately the formal definition of LALR(1) is not trivial. I just finished teaching a compiler construction course and we were only able to talk about LALR(1) after first spending two lectures discussing related parsing techniques. If you'd like a formal introduction to the material, my slides on bottom-up parsing are available at the course website.

LALR(1) is a type of parsing algorithm called a bottom-up parsing algorithm, which means that it tries to apply the productions of the grammar in reverse order to reduce the program back to the start symbol. For example, let's consider this string, which is generated by the above grammar:

int** ident(int* ident)

In a bottom-up parse, we would parse this string by looking at the program one token at a time. Whenever we found something that could be reversed back to some nonterminal, we do so. (To be more precise, LALR(1) only does these sorts of reductions when other criteria are met as well so that the algorithm has more context, but for this example we don't need to worry about this). Each step in the parse is said either to be a shift or a reduce. A shift means that we look at one more token of the input to gain more information about what reductions to apply. A reduce means that we take some number of the tokens and nonterminals and reverse a production to get back to some nonterminal.

Here's a trace of the bottom-up parse of the string:

Workspace           Input                     Action
-----------------------------------------------------------------
             int** ident(int* ident)   Shift
int             ** ident(int* ident)   Reduce Type -> int
Type            ** ident(int* ident)   Shift
Type*            * ident(int* ident)   Reduce Type -> Type*
Type             * ident(int* ident)   Shift
Type*              ident(int* ident)   Reduce Type -> Type*
Type               ident(int* ident)   Shift
Type ident              (int* ident)   Shift
Type ident(              int* ident)   Shift
Type ident(int              * ident)   Reduce Type -> int
Type ident(Type             * ident)   Shift
Type ident(Type*              ident)   Reduce Type -> Type*
Type ident(Type               ident)   Shift
Type ident(Type ident              )   Reduce ArgList -> Type ident
Type ident(ArgList                 )   Reduce Arguments -> ArgList
Type ident(Arguments               )   Shift
Type ident(Arguments)                  Reduce Function -> Type ident(Arguments)
Function                               ACCEPT

It's important to know about shifts and reductions because you will invariable encounter shift/reduce conflicts and reduce/reduce conflicts when using bison. These errors mean that the parser generator determined that the parser might get into a state where it can't tell whether to shift or reduce, or which of two reductions it's supposed to perform. Consult the bison manual for more details about how to deal with this.

If you'd like to learn more about context-free grammars and parsing algorithms in general, there is an excellent book entitled Parsing Techniques: A Practical Guide, Second Edition by Grune and Jacobs that has, by far, the best treatment of the material I've ever seen. It covers all sorts of parsing algorithms, including many techniques that are substantially more powerful than LALR(1) that are starting to get wider usage (GLR and Earley, for example). I highly recommend this book - it's the main reason I understand parsing so well!

Hope this helps!

181

answered Nov 12 '22 15:11

templatetypedef

Related questions
                            
                                Addition of chars adding one character in front
                            
                                Learning about Big O notation with C++, confused as to whether this is O(n) or O(n2)
                            
                                Algorithm: Merge overlapping segments
                            
                                return count of scatter palindrome of a string [closed]
                            
                                What does "x += x & (-x)" mean?
                            
                                Real-time peak detection in noisy sinusoidal time-series
                            
                                How do you rate-limit an IO operation?
                            
                                Why do you need lots of randomness for effective encryption?
                            
                                Algorithm Speed Order of
                            
                                Algorithm for finding total idle time given array of start/end times
                            
                                Can a SHA-1 hash be purely numeric?
                            
                                Shortest Path For A Dag
                            
                                How to convert an 18 Character String into a Unique ID?
                            
                                I need to choose a compression algorithm
                            
                                Is there any simple way to test two PNGs for equality?
                            
                                Excel GoalSeek algorithm
                            
                                Is it a solvable problem to generate a regular expression that matches some input set?
                            
                                Generating random floating-point values based on random bit stream
                            
                                Any other way to Reverse String Without using array?
                            
                                Why do we need a priority queue in Prim's Algorithm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Learning bison: What are context-free grammars and LALR(1)?

Tags:

algorithm

parsing

grammar

bison

lalr

hari

People also ask

1 Answers

templatetypedef

Recent Activity

Donate For Us