Can anyone outline for me an algorithm that can convert any given regex into an equivalent set of CFG rules? I know how to tackle the elementary stuff such as (a|b)*: <pre class="prettyprint"><code>S -> a A S -> a B S -> b A S -> b B A -> a A A -> a B A -> epsilon B -> b A B -> b B B -> epsilon S -> epsilon (end of string) </code></pre> However, I'm having some problem formalizing it into a proper algorithm especially with more complex expressions that can have many nested operations.

If you are just talking about regular expressions from a theoretical point of view, there are these three constructs: <pre class="prettyprint"><code>ab # concatenation a|b # alternation a* # repetition or Kleene closure </code></pre> What you could then just do: <ul> <li>create a rule <code>S -> (fullRegex)</code> </li> <li>for every repeated term <code>(x)*</code> in <code>fullRegex</code> create a rule <code>X -> x X</code> and <code>X -> ε</code>, then replace <code>(x)*</code> with <code>X</code>.</li> <li>for every alternation <code>(a|b|c)</code> create rules <code>Y -> a</code>, <code>Y -> b</code> and <code>Y -> c</code>, then replace <code>(a|b|c)</code> with <code>Y</code> </li> </ul> Simply repeat this recursively (note that all <code>x,</code> <code>a</code>, <code>b</code> and <code>c</code> can still be complex regular expressions). Note that of course you have to use unique identifiers for every step. This should be enough. This will certainly not give the most elegant or efficient grammar, but that is what normalization is for (and it should be done in a separate step and there are well-defined steps to do this). One example: <code>a(b|cd*(e|f)*)*</code> <pre class="prettyprint"><code>S -> a(b|cd*(e|f)*)* S -> a X1; X1 -> (b|cd*(e|f)*) X1; X1 -> ε S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> cd*(e|f)* S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> c X2 (e|f)*; X2 -> d X2; X2 -> ε ... and a few more of those steps, until you end up with: S -> a X1 X1 -> Y1 X1 X1 -> ε Y1 -> b Y1 -> c X2 X3 X2 -> d X2 X2 -> ε X3 -> Y2 X3 X3 -> ε Y2 -> e Y2 -> f </code></pre>

Algorithm to generate context free grammar from any regex

Tags:

regex

algorithm

nlp

context-free-grammar

computation-theory

Can anyone outline for me an algorithm that can convert any given regex into an equivalent set of CFG rules?

I know how to tackle the elementary stuff such as (a|b)*:

S -> a A
S -> a B
S -> b A
S -> b B
A -> a A
A -> a B
A -> epsilon
B -> b A
B -> b B
B -> epsilon
S -> epsilon (end of string)

However, I'm having some problem formalizing it into a proper algorithm especially with more complex expressions that can have many nested operations.

777

asked Oct 30 '12 13:10

gamerx

1 Answers

If you are just talking about regular expressions from a theoretical point of view, there are these three constructs:

ab       # concatenation
a|b      # alternation
a*       # repetition or Kleene closure

What you could then just do:

create a rule S -> (fullRegex)
for every repeated term (x)* in fullRegex create a rule X -> x X and X -> ε, then replace (x)* with X.
for every alternation (a|b|c) create rules Y -> a, Y -> b and Y -> c, then replace (a|b|c) with Y

Simply repeat this recursively (note that all x, a, b and c can still be complex regular expressions). Note that of course you have to use unique identifiers for every step.

This should be enough. This will certainly not give the most elegant or efficient grammar, but that is what normalization is for (and it should be done in a separate step and there are well-defined steps to do this).

One example: a(b|cd*(e|f)*)*

S -> a(b|cd*(e|f)*)*

S -> a X1; X1 -> (b|cd*(e|f)*) X1; X1 -> ε

S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> cd*(e|f)*

S -> a X1; X1 -> Y1 X1; X1 -> ε; Y1 -> b; Y1 -> c X2 (e|f)*; X2 -> d X2; X2 -> ε

... and a few more of those steps, until you end up with:

S  -> a X1
X1 -> Y1 X1
X1 -> ε
Y1 -> b
Y1 -> c X2 X3
X2 -> d X2
X2 -> ε
X3 -> Y2 X3
X3 -> ε
Y2 -> e
Y2 -> f

155

answered Sep 21 '22 06:09

Martin Ender

Related questions
                            
                                Looking for regex to split on the string on upper case basis
                            
                                Is there a PHP class out there that can clean up content?
                            
                                Regex needed to split a string by "."
                            
                                Why Regex IsMatch() hangs
                            
                                Perl split and regular expression
                            
                                BASH regexp matching - including brackets in a bracketed list of characters to match against?
                            
                                Regex to check non-repetition of a set of characters
                            
                                Validating Alpha-Numeric values with all Special Characters
                            
                                Python regex to remove substrings inside curly braces
                            
                                Regex for string that starts but doesn't end with "
                            
                                Explain this UTF-8 detection regex
                            
                                python match only captures first and last group - am I misunderstanding something?
                            
                                Regular Expression to Match a Single CSS Property
                            
                                java regular expression returning false
                            
                                regex that matches any positive or negative numeric value but no characters or mixed strings [closed]
                            
                                regular expression \Z(?ms)
                            
                                Regex for matching season and episode
                            
                                Android Java Regex Match
                            
                                Python regex uppercase unicode word
                            
                                Sort a list by digits appearing after trash digits in VIM

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With