I am reading this bison introduction.
I have two questions and it will be great if someone can help me understand:
What does term context free grammar
mean?
From the link above: Not all context-free languages can be handled by Bison, only those that are LALR(1). In brief, this means that it must be possible to tell how to parse any portion of an input string with just a single token of look-ahead. What does it mean by "possible to tell how to parse any portion of an input string with just a single token of look-ahead`"
In order for Bison to parse a language, it must be described by a context-free grammar. This means that you specify one or more syntactic groupings and give rules for constructing them from their parts.
LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the canonical collection of LR (1) items. In the LALR (1) parsing, the LR (1) items which have same productions but different look ahead are combined to form a single set of items.
Bison is a general-purpose parser generator that converts an annotated context-free grammar into a deterministic LR or generalized LR (GLR) parser employing LALR (1) parser tables. As an experimental feature, Bison can also generate IELR (1) or canonical LR(1) parser tables.
3 Answers. For LL(1) You have to check by using First and Follow method. For LR(0) and SLR(1) you have to do augmented transition method, and then by making state transition diagram, you have to look where Shift-Reduce and Reduce Reduce conflicts are present and according to that you've to eliminate the parser.
A context-free grammar is a description of a set of strings that is strictly more powerful than the regular expressions, but still easily handled by a machine. More formally, a context-free grammar consists of four things:
A set of terminal symbols, which are the elements of the strings being produced. For a bison parser, this is typically the set of the tokens produced by the scanner. For a grammar for a natural language like English, this might be the set of all English words.
A set of nonterminal symbols. Intuitively, a nonterminal symbol represents something like a part of speech, such as "noun" or "verb."
A set of productions. Each production says how to replace a nonterminal symbol with some other set of terminals and nonterminals. For example, the production Variable -> Type Name
says that if we see the nonterminal Variable
, we may replace it with the string Type Name
.
A start symbol, which is the nonterminal from which the derivation begins.
As an example, consider this simple context-free grammar for C-style function declarations:
Function -> Type ident(Arguments)
Type -> int
Type -> Type *
Arguments -> e (the empty string)
Arguments -> ArgList
ArgList -> Type ident
ArgList -> Type ident, ArgList
Here, the start symbol is Function
. Given this grammar, we can produce C-style function declarations by repeatedly picking a nonterminal symbol and replacing it with one of the right-hand sides of a matching production. At each step, the string we've built so far is called a sentential form. For example, here are some different function declarations that can be derived from the above grammar:
Sentential Form Production
-------------------------------------------------------------------
Function Function -> Type ident(Arguments)
Type ident(Arguments) Type -> int
int ident(Arguments) Arguments -> e
int ident()
Sentential Form Production
-------------------------------------------------------------------
Function Function -> Type ident(Arguments)
Type ident(Arguments) Type -> Type*
Type* ident(Arguments) Type -> int
int* ident(Arguments) Arguments -> ArgList
int* ident(ArgList) ArgList -> Type ident, ArgList
int* ident(Type ident, ArgList) ArgList -> Type ident
int* ident(Type ident, Type ident) Type -> Type*
int* ident(Type* ident, Type ident) Type -> Type*
int* ident(Type** ident, Type ident) Type -> int
int* ident(int** ident, Type ident) Type -> int
int* ident(int** ident, int ident)
Most programming languages have a structure that can be described by a context-free grammar. The job of the parser is to take a program and a grammar and to determine how that program can be generated by the grammar.
As for LALR(1), unfortunately the formal definition of LALR(1) is not trivial. I just finished teaching a compiler construction course and we were only able to talk about LALR(1) after first spending two lectures discussing related parsing techniques. If you'd like a formal introduction to the material, my slides on bottom-up parsing are available at the course website.
LALR(1) is a type of parsing algorithm called a bottom-up parsing algorithm, which means that it tries to apply the productions of the grammar in reverse order to reduce the program back to the start symbol. For example, let's consider this string, which is generated by the above grammar:
int** ident(int* ident)
In a bottom-up parse, we would parse this string by looking at the program one token at a time. Whenever we found something that could be reversed back to some nonterminal, we do so. (To be more precise, LALR(1) only does these sorts of reductions when other criteria are met as well so that the algorithm has more context, but for this example we don't need to worry about this). Each step in the parse is said either to be a shift or a reduce. A shift means that we look at one more token of the input to gain more information about what reductions to apply. A reduce means that we take some number of the tokens and nonterminals and reverse a production to get back to some nonterminal.
Here's a trace of the bottom-up parse of the string:
Workspace Input Action
-----------------------------------------------------------------
int** ident(int* ident) Shift
int ** ident(int* ident) Reduce Type -> int
Type ** ident(int* ident) Shift
Type* * ident(int* ident) Reduce Type -> Type*
Type * ident(int* ident) Shift
Type* ident(int* ident) Reduce Type -> Type*
Type ident(int* ident) Shift
Type ident (int* ident) Shift
Type ident( int* ident) Shift
Type ident(int * ident) Reduce Type -> int
Type ident(Type * ident) Shift
Type ident(Type* ident) Reduce Type -> Type*
Type ident(Type ident) Shift
Type ident(Type ident ) Reduce ArgList -> Type ident
Type ident(ArgList ) Reduce Arguments -> ArgList
Type ident(Arguments ) Shift
Type ident(Arguments) Reduce Function -> Type ident(Arguments)
Function ACCEPT
It's important to know about shifts and reductions because you will invariable encounter shift/reduce conflicts and reduce/reduce conflicts when using bison. These errors mean that the parser generator determined that the parser might get into a state where it can't tell whether to shift or reduce, or which of two reductions it's supposed to perform. Consult the bison manual for more details about how to deal with this.
If you'd like to learn more about context-free grammars and parsing algorithms in general, there is an excellent book entitled Parsing Techniques: A Practical Guide, Second Edition by Grune and Jacobs that has, by far, the best treatment of the material I've ever seen. It covers all sorts of parsing algorithms, including many techniques that are substantially more powerful than LALR(1) that are starting to get wider usage (GLR and Earley, for example). I highly recommend this book - it's the main reason I understand parsing so well!
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With