Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What type of parser is needed for this grammar?

I have a grammar that I do not know what type of parser I need in order to parse it other than I do not believe the grammar is LL(1). I am thinking I need a parser with backtracking or LL(*) of some sort. The grammar I have came up with (which may need some rewriting) is:

S: Rules
Rules: Rule | Rule Rules
Rule: id '=' Ids
Ids: id | Ids id

The language I am trying to generate looks something like this:

abc = def g hi jk lm
xy = aaa bbb ccc ddd eee fff jjj kkk
foo = bar ha ha 

Zero or more Rule that contain a left identifier followed by an equal sign followed by one or more identifers. The part that I think I will have a problem writing a parser for is that the grammar allows any amount of id in a Rule and that the only way to tell when a new Rule starts is when it finds id =, which would require backtracking.

Does anyone know the classification of this grammar and the best method of parsing, for a hand written parser?

like image 917
Austin Henley Avatar asked Oct 10 '22 18:10

Austin Henley


1 Answers

The grammar that generates an identifier followed by an equals sign followed by a finite sequence of identifiers is regular. This means that strings in the language can be parsed using a DFA or regular expression. No need for fancy nondeterministic or LL(*) parsers.

To see that the language is regular, let Id = U {a : a ∈ Γ}, where Γ ⊂ Σ is the set of symbols that can occur in identifiers. The language you are trying to generate is denoted by the regular expression

  • Id+ =( Id+)*Id+

Setting Γ = {a, b, ..., z}, examples of strings in the language of the regular expression are:

  • look = i am in a regular language
  • hey = that means i can be recognized by a dfa
  • cool = or even a regular expression

There is no need to parse your language using powerful parsing techniques. This is one case where parsing using regular expressions or DFA is both appropriate and optimal.

edit:

Call the above regular expression R. To parse R*, generate a DFA recognizing the language of R*. To do this, generate an NFA recognizing the language of R* using the algorithm obtainable from Kleene's theorem. Then convert the NFA into a DFA using the subset construction. The resultant DFA will recognize all strings in R*. Given a representation of the constructed DFA in your implementation language, the required actions - for instance,

  • Add the last identifier parsed to the right-hand side of the current declaration statement being parsed
  • Add the last declaration statement parsed to a list of parsed declarations, and use the last identifier parsed to begin parsing a new declaration statement

can be encoded into the states of the DFA. In reality, using Kleene's theorem and the subset construction is probably unnecessary for such a simple language. That is, you can probably just write a parser with the above two actions without implementing an automaton. Given a more complicated regular langauge (for instance, the lexical structure of a programming langauge), the conversion would be the best option.

like image 155
emi Avatar answered Oct 13 '22 11:10

emi