I have this regular expression:
regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x
When I test it against several strings, it appears to be as powerful as a context free grammar because it handles the recursion properly.
regex.match("aaacaaa")
# => #<MatchData "aaacaaa" foo:"aaacaaa">
regex.match("aacaa")
# => #<MatchData "aacaa" foo:"aacaa">
regex.match("aabcbaa")
# => #<MatchData "aabcbaa" foo:"aabcbaa">
regex.match("aaacaa")
# => nil
"Fun with Ruby 1.9 Regular Expressions" has an example where he actually arranges all the parts of a regex so that it looks like a context-free grammar as follows:
sentence = %r{
(?<subject> cat | dog | gerbil ){0}
(?<verb> eats | drinks| generates ){0}
(?<object> water | bones | PDFs ){0}
(?<adjective> big | small | smelly ){0}
(?<opt_adj> (\g<adjective>\s)? ){0}
The\s\g<opt_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x
Between his technique for rearranging the parts of the regex, and my example of recursive named capturing groups, does this mean Ruby 1.9 regular expressions have the power equivalent to a context-free grammar?
Context-free grammars are strictly more powerful than regular expressions. Any language that can be generated using regular expressions can be generated by a context-free grammar. There are languages that can be generated by a context-free grammar that cannot be generated by any regular expression.
Two uses of ruby regex are Validation and Parsing. Ruby regex can be used to validate an email address and an IP address too. Ruby regex expressions are declared between two forward slashes. This will return the index of first occurrence of the word 'hi' if present, or else will return ' nil '.
EDIT: as pointed out in the comments the regex engine used by ruby is Onigmo which according to the linked to page has some of the new features of Perl 5.10.
This is one of the awesome things about the Oniguruma regexp engine used in Ruby 1.9 – it has the power of a parser, and is not restricted to recognizing regular languages. It has positive and negative lookahead/lookbehind, which even can be used to recognize some languages which are not context-free! Take the following as an example:
regexp = /\A(?<AB>a\g<AB>b|){0}(?=\g<AB>c)a*(?<BC>b\g<BC>c|){1}\Z/
This regexp recognizes strings like “abc”, “aabbcc”, “aaabbbccc”, and so on – the number of “a”, “b”, and “c” must be equal, or it will not match.
(One limitation: you can’t use named groups in the lookahead and lookbehind.)
Although I haven’t peeked under the hood, Oniguruma seems to deal with named groups by simple recursive descent, backing up when something doesn’t match. I’ve observed that it can’t deal with left recursion. For example:
irb(main):013:0> regexp = /(?<A>\g<A>a|)/
SyntaxError: (irb):13: never ending recursion: /(?<A>\g<A>a|)/
from C:/Ruby192/bin/irb:12:in `<main>'
I don’t remember my parsing theory very clearly, but I think that a non-deterministic top-down parser like this should be able to parse any context-free language. (“language”, not “grammar”; if your grammar has left recursion, you will have to convert it to right recursion.) If that is incorrect, please edit this post.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With