Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching balanced parenthesis in Ruby using recursive regular expressions like perl

Tags:

regex

ruby

perl

I have been looking for a way to match balanced parenthesis in a regex and found a way in Perl, that uses a recursive regular expression:

my $re;
$re = qr{
           \(
              (?:
                 (?> [^()]+ )       # Non-parens without backtracking
                 |
                 (??{ $re })        # Group with matching parens
              )*
           \)
         }x;

from the perl regular expression site .

Is there a way to do this in Ruby or a similar language?

UPDATE:

For those interested here are some interesting links:

Oniguruma manual - from Sawa's answer.

Pragmatic Programmers' Ruby 1.9 Regular Expressions Sample Chapter

like image 994
Yet Another Geek Avatar asked Jun 13 '11 13:06

Yet Another Geek


People also ask

How do you match parentheses in regex?

The way we solve this problem—i.e., the way we match a literal open parenthesis '(' or close parenthesis ')' using a regular expression—is to put backslash-open parenthesis '\(' or backslash-close parenthesis '\)' in the RE. This is another example of an escape sequence.

What does =~ mean in Ruby regex?

=~ is Ruby's basic pattern-matching operator. When one operand is a regular expression and the other is a string then the regular expression is used as a pattern to match against the string. (This operator is equivalently defined by Regexp and String so the order of String and Regexp do not matter.

What kind of regex does Ruby use?

A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings. Ruby regular expressions i.e. Ruby regex for short, helps us to find particular patterns inside a string. Two uses of ruby regex are Validation and Parsing.

Are balanced parentheses regular?

There is no deterministic finite automaton that recognizes balanced parentheses. And therefore, balanced parentheses is not a regular language.


2 Answers

Yes. With oniguruma regex engine, which is built in in Ruby 1.9, and is installable on Ruby 1.8, you can do that. You name a subregex with (?<name>...) or (?'name'...). Then you call a subregex with \g<name> or \g'name' within the same regex. So your regex translated to oniguruma regex will be:

re = %r{
  (?<re>
    \(
      (?:
        (?> [^()]+ )
        |
        \g<re>
      )*
    \)
  )
}x

Also note that multi-byte string module in PHP >=5 uses oniguruma regex engine, so you will be able to do the same.

The manual for oniguruma is here.

like image 66
sawa Avatar answered Oct 22 '22 11:10

sawa


I like the above solution but frequently one wishes to ignore escaped characters. Assuming that \ escapes the following character the following regex handles escaped characters as well.

ESC= /(?<![\\])(?>[\\](?:[\\][\\])*)/
UNESC= /(?:\A|(?<=[^\\]))(?:[\\][\\])*/
BALANCED_PARENS = /#{UNESC}(
                   (?<bal>\(
                    (?>
                      (?>  (?:#{ESC}\(|#{ESC}\)|[^()])+     )
                      |\g<bal>
                    )*
                    \))    ) /xm

Given the limitations of negative lookbehind the part delimited by matching parens will be the first capture not the whole match (the whole match may contain leading escaped backslashes).

The reason for the complexity of ESC and UNESC is the assumption that a \\ is an escaped backslash. We only use the UNESC sequence before the initial paren match since any other escaped parenthesis will be matched inside the atomic group and never backtracked. Indeed, if we tried to use the UNESC prefix for either an interior or final paren match it would fail when [^()] inside the atomic group matched the leading \'s and refused to backtrack.

This regex will scan for the first paren that delimits a validly balanced parenthetical. Thus, given the string " ( ( stuff )" it will match "( stuff )". Frequently, the desired behavior is to locate the first (unescaped) parenthesis and either match the interior (if balanced) or fail to match. Unfortunately, atomic grouping won't stop the entire regex from being backed out of and a match attempted at a later point so we must anchor at the start of the string and only look at the 1st capture. The following regex makes this change:

BALANCED_PARENS = /\A(?:#{ESC}\(|#{ESC}\)|[^()])*+
                  (?<match>\(
                   (?<bal>
                    (?>
                      (?>  (?:#{ESC}\(|#{ESC}\)|[^()])+     )
                      |\(\g<bal>
                    )*
                    \))    ) /xm
like image 20
Peter Gerdes Avatar answered Oct 22 '22 13:10

Peter Gerdes