Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Learning bison: What are context-free grammars and LALR(1)?

I am reading this bison introduction.

I have two questions and it will be great if someone can help me understand:

  1. What does term context free grammar mean?

  2. From the link above: Not all context-free languages can be handled by Bison, only those that are LALR(1). In brief, this means that it must be possible to tell how to parse any portion of an input string with just a single token of look-ahead. What does it mean by "possible to tell how to parse any portion of an input string with just a single token of look-ahead`"

like image 660
hari Avatar asked Aug 24 '11 16:08

hari


People also ask

Is Bison context-free grammar?

In order for Bison to parse a language, it must be described by a context-free grammar. This means that you specify one or more syntactic groupings and give rules for constructing them from their parts.

What is an LALR 1 grammar?

LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the canonical collection of LR (1) items. In the LALR (1) parsing, the LR (1) items which have same productions but different look ahead are combined to form a single set of items.

What is Bison in programming?

Bison is a general-purpose parser generator that converts an annotated context-free grammar into a deterministic LR or generalized LR (GLR) parser employing LALR (1) parser tables. As an experimental feature, Bison can also generate IELR (1) or canonical LR(1) parser tables.

How do you know if a grammar is LALR 1?

3 Answers. For LL(1) You have to check by using First and Follow method. For LR(0) and SLR(1) you have to do augmented transition method, and then by making state transition diagram, you have to look where Shift-Reduce and Reduce Reduce conflicts are present and according to that you've to eliminate the parser.


1 Answers

A context-free grammar is a description of a set of strings that is strictly more powerful than the regular expressions, but still easily handled by a machine. More formally, a context-free grammar consists of four things:

  • A set of terminal symbols, which are the elements of the strings being produced. For a bison parser, this is typically the set of the tokens produced by the scanner. For a grammar for a natural language like English, this might be the set of all English words.

  • A set of nonterminal symbols. Intuitively, a nonterminal symbol represents something like a part of speech, such as "noun" or "verb."

  • A set of productions. Each production says how to replace a nonterminal symbol with some other set of terminals and nonterminals. For example, the production Variable -> Type Name says that if we see the nonterminal Variable, we may replace it with the string Type Name.

  • A start symbol, which is the nonterminal from which the derivation begins.

As an example, consider this simple context-free grammar for C-style function declarations:

Function -> Type ident(Arguments)
Type -> int
Type -> Type *
Arguments -> e   (the empty string)
Arguments -> ArgList
ArgList -> Type ident
ArgList -> Type ident, ArgList

Here, the start symbol is Function. Given this grammar, we can produce C-style function declarations by repeatedly picking a nonterminal symbol and replacing it with one of the right-hand sides of a matching production. At each step, the string we've built so far is called a sentential form. For example, here are some different function declarations that can be derived from the above grammar:

Sentential Form                       Production
-------------------------------------------------------------------
Function                              Function -> Type ident(Arguments)
Type ident(Arguments)                 Type -> int
int ident(Arguments)                  Arguments -> e
int ident()

Sentential Form                       Production
-------------------------------------------------------------------
Function                              Function -> Type ident(Arguments)
Type ident(Arguments)                 Type -> Type*
Type* ident(Arguments)                Type -> int
int* ident(Arguments)                 Arguments -> ArgList
int* ident(ArgList)                   ArgList -> Type ident, ArgList
int* ident(Type ident, ArgList)       ArgList -> Type ident
int* ident(Type ident, Type ident)    Type -> Type*
int* ident(Type* ident, Type ident)   Type -> Type*
int* ident(Type** ident, Type ident)  Type -> int
int* ident(int** ident, Type ident)   Type -> int
int* ident(int** ident, int ident)

Most programming languages have a structure that can be described by a context-free grammar. The job of the parser is to take a program and a grammar and to determine how that program can be generated by the grammar.

As for LALR(1), unfortunately the formal definition of LALR(1) is not trivial. I just finished teaching a compiler construction course and we were only able to talk about LALR(1) after first spending two lectures discussing related parsing techniques. If you'd like a formal introduction to the material, my slides on bottom-up parsing are available at the course website.

LALR(1) is a type of parsing algorithm called a bottom-up parsing algorithm, which means that it tries to apply the productions of the grammar in reverse order to reduce the program back to the start symbol. For example, let's consider this string, which is generated by the above grammar:

int** ident(int* ident)

In a bottom-up parse, we would parse this string by looking at the program one token at a time. Whenever we found something that could be reversed back to some nonterminal, we do so. (To be more precise, LALR(1) only does these sorts of reductions when other criteria are met as well so that the algorithm has more context, but for this example we don't need to worry about this). Each step in the parse is said either to be a shift or a reduce. A shift means that we look at one more token of the input to gain more information about what reductions to apply. A reduce means that we take some number of the tokens and nonterminals and reverse a production to get back to some nonterminal.

Here's a trace of the bottom-up parse of the string:

Workspace           Input                     Action
-----------------------------------------------------------------
             int** ident(int* ident)   Shift
int             ** ident(int* ident)   Reduce Type -> int
Type            ** ident(int* ident)   Shift
Type*            * ident(int* ident)   Reduce Type -> Type*
Type             * ident(int* ident)   Shift
Type*              ident(int* ident)   Reduce Type -> Type*
Type               ident(int* ident)   Shift
Type ident              (int* ident)   Shift
Type ident(              int* ident)   Shift
Type ident(int              * ident)   Reduce Type -> int
Type ident(Type             * ident)   Shift
Type ident(Type*              ident)   Reduce Type -> Type*
Type ident(Type               ident)   Shift
Type ident(Type ident              )   Reduce ArgList -> Type ident
Type ident(ArgList                 )   Reduce Arguments -> ArgList
Type ident(Arguments               )   Shift
Type ident(Arguments)                  Reduce Function -> Type ident(Arguments)
Function                               ACCEPT

It's important to know about shifts and reductions because you will invariable encounter shift/reduce conflicts and reduce/reduce conflicts when using bison. These errors mean that the parser generator determined that the parser might get into a state where it can't tell whether to shift or reduce, or which of two reductions it's supposed to perform. Consult the bison manual for more details about how to deal with this.

If you'd like to learn more about context-free grammars and parsing algorithms in general, there is an excellent book entitled Parsing Techniques: A Practical Guide, Second Edition by Grune and Jacobs that has, by far, the best treatment of the material I've ever seen. It covers all sorts of parsing algorithms, including many techniques that are substantially more powerful than LALR(1) that are starting to get wider usage (GLR and Earley, for example). I highly recommend this book - it's the main reason I understand parsing so well!

Hope this helps!

like image 181
templatetypedef Avatar answered Nov 12 '22 15:11

templatetypedef