Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to generate context free grammar from any regex

Can anyone outline for me an algorithm that can convert any given regex into an equivalent set of CFG rules?

I know how to tackle the elementary stuff such as (a|b)*:

S -> a A
S -> a B
S -> b A
S -> b B
A -> a A
A -> a B
A -> epsilon
B -> b A
B -> b B
B -> epsilon
S -> epsilon (end of string)

However, I'm having some problem formalizing it into a proper algorithm especially with more complex expressions that can have many nested operations.

like image 777
gamerx Avatar asked Oct 30 '12 13:10

gamerx


People also ask

How do you create a CFG in RegEx?

use states as nonterminal symbols. use language as set of terminal symbols. add a transition p -> aq for any transition p -> q on letter a in the original automaton. use initial state as initial symbol in the grammar.

How do you write a context-free grammar for a regular expression?

The Context-free grammar form NFA for the Regular Expression using the following construction rules: For each state there is a Non-Terminal symbol. If state A has a transition to state B on a symbol a. IF state A goes to state B, input symbol is e.

What algorithm does RegEx use?

Most library implementations of regular expressions use a backtracking algorithm that can take an exponential amount of time on some inputs.

Can every regular expression can be expressed using a context-free grammar?

Every regular grammar is context-free, but not all context-free grammars are regular. The following context-free grammar, for example, is also regular. This grammar is regular: no rule has more than one nonterminal in its right-hand side, and each of these nonterminals is at the same end of the right-hand side.


1 Answers

If you are just talking about regular expressions from a theoretical point of view, there are these three constructs:

ab       # concatenation
a|b      # alternation
a*       # repetition or Kleene closure

What you could then just do:

  • create a rule S -> (fullRegex)
  • for every repeated term (x)* in fullRegex create a rule X -> x X and X -> ε, then replace (x)* with X.
  • for every alternation (a|b|c) create rules Y -> a, Y -> b and Y -> c, then replace (a|b|c) with Y

Simply repeat this recursively (note that all x, a, b and c can still be complex regular expressions). Note that of course you have to use unique identifiers for every step.

This should be enough. This will certainly not give the most elegant or efficient grammar, but that is what normalization is for (and it should be done in a separate step and there are well-defined steps to do this).

One example: a(b|cd*(e|f)*)*

S -> a(b|cd*(e|f)*)*

S -> a X1; X1 -> (b|cd*(e|f)*) X1; X1 -> ε

S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> cd*(e|f)*

S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> c X2 (e|f)*; X2 -> d X2; X2 -> ε

... and a few more of those steps, until you end up with:

S  -> a X1
X1 -> Y1 X1
X1 -> ε
Y1 -> b
Y1 -> c X2 X3
X2 -> d X2
X2 -> ε
X3 -> Y2 X3
X3 -> ε
Y2 -> e
Y2 -> f
like image 155
Martin Ender Avatar answered Sep 21 '22 06:09

Martin Ender