Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a Backus–Naur form grammar to a .Net regex

Tags:

.net

regex

bnf

Is there any way to convert the following Backus–Naur form (BNF) grammar into a .Net regex? (I'm not stuck on the BNF, but I thought it might be the best way to explain what I was trying to do).

<field> ::= "<<" <fieldname> <options> ">>"

<options> ::= "" | "(" <option> ")"

<option> ::= "" | 
             <option> <non-paren> | 
             <option> <escaped-character>

<escaped-character> ::= "\\" | "\)"

<non-paren> ::= any character but paren

<fieldname> ::= any string that doesn't contain "(" or ">>"

I'm close, but I can't figure out how to deal with escaping \ and ). This captures the fieldname and option in named groups:

<<(?<fieldname>.\*?)(\((?<option>.*?)\))?>>

Edit

It turns out that I was rustier at BNF grammars than I thought.

What I was trying to get at is that parenthesis are special characters. Inside the "option" section, they must be escaped by a slash. (And slashes must also be escaped).

like image 918
Adam Tegen Avatar asked Jan 20 '09 23:01

Adam Tegen


People also ask

How do you do grammar in regex?

Consider the regular expression (a + b)*a. We will now construct a regular grammar for this regular expression. For every terminal symbol a, we create a regular grammar with the rule S \arrow a, start symbol S. We then apply the transformations to these regular grammars, progressively constructing the regular grammar.

How is Backus-Naur form used to describe a formal language?

Backus-Naur notation (shortly BNF) is a formal mathematical way to describe a language, (to describe the syntax of the programming languages). where the LHS is a non-terminal symbol and the RHS is a sequence of symbols (terminals or non-terminals).

What is the difference between BNF and CFG?

BNF and CFG (Context Free Grammar) were nearly identical. BNF may be a meta-language (a language that cannot describe another language) for primary languages. The symbol ::= means “may expand into” and “may get replaced with.” In some texts, a reputation is additionally called a non-terminal symbol.

What is the difference between the BNF and EBNF?

BNF syntax can only represent a rule in one line, whereas in EBNF a terminating character, the semicolon, marks the end of a rule. Furthermore, EBNF includes mechanisms for enhancements, defining the number of repetitions, excluding alternatives, comments, etc.


2 Answers

BNF is used to describe context-free languages, which regex can't normally describe. What separates context-free languages and regex is that context-free langauges can have recursion on both sides at the same time. A classic example is the balanced parenthesis problem.

paren = paren paren
      | '(' paren ')'  <-- there are characters on both sides of the recursion
      | ''

In your case, you don't use any double-sided recursion, so it reduces to a regular language.

fieldname = /(?:>?[^(>])+/    //No double >, but single ones are ok.
option = /(?:[^()\\]|\\.)*/   //No parens, unless preceeded by \

pattern = /<<(?<fieldname>   )(?:\((?<option>   )\))?>>/

Putting it together:

pattern = /<<(?<fieldname>(?:>?[^(>])+)(?:\((?<option>(?:[^()\\]|\\.)*)\))?>>/

Some border cases:

<<f>oo(bar>>)>> --> ('f>oo', 'bar>>')
<<foo(bar\))>>  --> ('foo', 'bar\)')
<<foo(bar\\)>>  --> ('foo', 'bar\\')
<<foo\(bar)>>   --> ('foo\', 'bar')

EDIT:

If you want any extra parenthesis characters (and back-slashes) to have to be escaped inside << and >>, you could do something like this:

fieldname = /(?:<?[^()\\<]|<?\\[()\\])+/
options = /(?:[^()\\]|\\[()\\])*/
pattern = /<<(?<fieldname>   )(?:\((?<option>   )\))?>>/

/<<(?<fieldname>(?:<?[^()\\]|<?\\[()\\])+)(?:\((?<option>(?:[^()\\]|\\[()\\])*)\))?>>/

updated:

<<f>oo(bar>>)>> --> ('f>oo', 'bar>>')
<<foo(bar\))>>  --> ('foo', 'bar\)')
<<foo(bar\\)>>  --> ('foo', 'bar\\')
<<foo\(bar)>>   --> doesn't match
<<foo\((bar)>>  --> ('foo\(', 'bar')
like image 107
Markus Jarderot Avatar answered Sep 30 '22 10:09

Markus Jarderot


Regular expressions denote regular languages. Context-free grammars generate context-free languages. The former language set is a subset of the latter and in the general case you cannot express a context-free language as a regular expression.

like image 24
user44511 Avatar answered Sep 30 '22 09:09

user44511