Is there any way to convert the following Backus–Naur form (BNF) grammar into a .Net regex? (I'm not stuck on the BNF, but I thought it might be the best way to explain what I was trying to do).
<field> ::= "<<" <fieldname> <options> ">>"
<options> ::= "" | "(" <option> ")"
<option> ::= "" |
<option> <non-paren> |
<option> <escaped-character>
<escaped-character> ::= "\\" | "\)"
<non-paren> ::= any character but paren
<fieldname> ::= any string that doesn't contain "(" or ">>"
I'm close, but I can't figure out how to deal with escaping \
and )
. This captures the fieldname
and option
in named groups:
<<(?<fieldname>.\*?)(\((?<option>.*?)\))?>>
Edit
It turns out that I was rustier at BNF grammars than I thought.
What I was trying to get at is that parenthesis are special characters. Inside the "option" section, they must be escaped by a slash. (And slashes must also be escaped).
Consider the regular expression (a + b)*a. We will now construct a regular grammar for this regular expression. For every terminal symbol a, we create a regular grammar with the rule S \arrow a, start symbol S. We then apply the transformations to these regular grammars, progressively constructing the regular grammar.
Backus-Naur notation (shortly BNF) is a formal mathematical way to describe a language, (to describe the syntax of the programming languages). where the LHS is a non-terminal symbol and the RHS is a sequence of symbols (terminals or non-terminals).
BNF and CFG (Context Free Grammar) were nearly identical. BNF may be a meta-language (a language that cannot describe another language) for primary languages. The symbol ::= means “may expand into” and “may get replaced with.” In some texts, a reputation is additionally called a non-terminal symbol.
BNF syntax can only represent a rule in one line, whereas in EBNF a terminating character, the semicolon, marks the end of a rule. Furthermore, EBNF includes mechanisms for enhancements, defining the number of repetitions, excluding alternatives, comments, etc.
BNF is used to describe context-free languages, which regex can't normally describe. What separates context-free languages and regex is that context-free langauges can have recursion on both sides at the same time. A classic example is the balanced parenthesis problem.
paren = paren paren
| '(' paren ')' <-- there are characters on both sides of the recursion
| ''
In your case, you don't use any double-sided recursion, so it reduces to a regular language.
fieldname = /(?:>?[^(>])+/ //No double >, but single ones are ok.
option = /(?:[^()\\]|\\.)*/ //No parens, unless preceeded by \
pattern = /<<(?<fieldname> )(?:\((?<option> )\))?>>/
Putting it together:
pattern = /<<(?<fieldname>(?:>?[^(>])+)(?:\((?<option>(?:[^()\\]|\\.)*)\))?>>/
Some border cases:
<<f>oo(bar>>)>> --> ('f>oo', 'bar>>')
<<foo(bar\))>> --> ('foo', 'bar\)')
<<foo(bar\\)>> --> ('foo', 'bar\\')
<<foo\(bar)>> --> ('foo\', 'bar')
EDIT:
If you want any extra parenthesis characters (and back-slashes) to have to be escaped inside <<
and >>
, you could do something like this:
fieldname = /(?:<?[^()\\<]|<?\\[()\\])+/
options = /(?:[^()\\]|\\[()\\])*/
pattern = /<<(?<fieldname> )(?:\((?<option> )\))?>>/
/<<(?<fieldname>(?:<?[^()\\]|<?\\[()\\])+)(?:\((?<option>(?:[^()\\]|\\[()\\])*)\))?>>/
updated:
<<f>oo(bar>>)>> --> ('f>oo', 'bar>>')
<<foo(bar\))>> --> ('foo', 'bar\)')
<<foo(bar\\)>> --> ('foo', 'bar\\')
<<foo\(bar)>> --> doesn't match
<<foo\((bar)>> --> ('foo\(', 'bar')
Regular expressions denote regular languages. Context-free grammars generate context-free languages. The former language set is a subset of the latter and in the general case you cannot express a context-free language as a regular expression.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With