Can anyone outline for me an algorithm that can convert any given regex into an equivalent set of CFG rules?
I know how to tackle the elementary stuff such as (a|b)*:
S -> a A
S -> a B
S -> b A
S -> b B
A -> a A
A -> a B
A -> epsilon
B -> b A
B -> b B
B -> epsilon
S -> epsilon (end of string)
However, I'm having some problem formalizing it into a proper algorithm especially with more complex expressions that can have many nested operations.
use states as nonterminal symbols. use language as set of terminal symbols. add a transition p -> aq for any transition p -> q on letter a in the original automaton. use initial state as initial symbol in the grammar.
The Context-free grammar form NFA for the Regular Expression using the following construction rules: For each state there is a Non-Terminal symbol. If state A has a transition to state B on a symbol a. IF state A goes to state B, input symbol is e.
Most library implementations of regular expressions use a backtracking algorithm that can take an exponential amount of time on some inputs.
Every regular grammar is context-free, but not all context-free grammars are regular. The following context-free grammar, for example, is also regular. This grammar is regular: no rule has more than one nonterminal in its right-hand side, and each of these nonterminals is at the same end of the right-hand side.
If you are just talking about regular expressions from a theoretical point of view, there are these three constructs:
ab # concatenation
a|b # alternation
a* # repetition or Kleene closure
What you could then just do:
S -> (fullRegex)
(x)*
in fullRegex
create a rule X -> x X
and X -> ε
, then replace (x)*
with X
.(a|b|c)
create rules Y -> a
, Y -> b
and Y -> c
, then replace (a|b|c)
with Y
Simply repeat this recursively (note that all x,
a
, b
and c
can still be complex regular expressions). Note that of course you have to use unique identifiers for every step.
This should be enough. This will certainly not give the most elegant or efficient grammar, but that is what normalization is for (and it should be done in a separate step and there are well-defined steps to do this).
One example: a(b|cd*(e|f)*)*
S -> a(b|cd*(e|f)*)*
S -> a X1; X1 -> (b|cd*(e|f)*) X1; X1 -> ε
S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> cd*(e|f)*
S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> c X2 (e|f)*; X2 -> d X2; X2 -> ε
... and a few more of those steps, until you end up with:
S -> a X1
X1 -> Y1 X1
X1 -> ε
Y1 -> b
Y1 -> c X2 X3
X2 -> d X2
X2 -> ε
X3 -> Y2 X3
X3 -> ε
Y2 -> e
Y2 -> f
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With