Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to embed regular expressions in other regular expressions in Ruby

Tags:

regex

ruby

I have a string:

'A Foo'

and want to find "Foo" in it.

I have a regular expression:

/foo/

that I'm embedding into another case-insensitive regular expression, so I can build the pattern in steps:

foo_regex = /foo/
pattern = /A #{ foo_regex }/i

But it won't match correctly:

'A Foo' =~ pattern # => nil

If I embed the text directly into the pattern it works:

'A Foo' =~ /A foo/i # => 0

What's wrong?

like image 458
the Tin Man Avatar asked Mar 27 '17 22:03

the Tin Man


People also ask

What does =~ mean in Ruby regex?

=~ is Ruby's basic pattern-matching operator. When one operand is a regular expression and the other is a string then the regular expression is used as a pattern to match against the string. (This operator is equivalently defined by Regexp and String so the order of String and Regexp do not matter.

How do you match expressions in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What does (? I do in regex?

(? i) makes the regex case insensitive. (? c) makes the regex case sensitive.


1 Answers

On the surface it seems that embedding a pattern inside another pattern would simply work, but that's based on a bad assumption of how patterns work in Ruby, that they're simply strings. Using:

foo_regex = /foo/

creates a Regexp object:

/foo/.class # => Regexp

As such it has knowledge of the optional flags used to create it:

( /foo/    ).options # => 0
( /foo/i   ).options # => 1
( /foo/x   ).options # => 2
( /foo/ix  ).options # => 3
( /foo/m   ).options # => 4
( /foo/im  ).options # => 5
( /foo/mx  ).options # => 6
( /foo/imx ).options # => 7

or, if you like binary:

'%04b' % ( /foo/    ).options # => "0000"
'%04b' % ( /foo/i   ).options # => "0001"
'%04b' % ( /foo/x   ).options # => "0010"
'%04b' % ( /foo/xi  ).options # => "0011"
'%04b' % ( /foo/m   ).options # => "0100"
'%04b' % ( /foo/mi  ).options # => "0101"
'%04b' % ( /foo/mx  ).options # => "0110"
'%04b' % ( /foo/mxi ).options # => "0111"

and remembers those whenever the Regexp is used, whether as a standalone pattern or if embedded in another.

You can see this in action if we look to see what the pattern looks like after embedding:

/#{ /foo/  }/ # => /(?-mix:foo)/
/#{ /foo/i }/ # => /(?i-mx:foo)/

?-mix: and ?i-mx: are how those options are represented in an embedded-pattern.

According to the Regexp documentation for Options:

i, m, and x can also be applied on the subexpression level with the (?on-off) construct, which enables options on, and disables options off for the expression enclosed by the parentheses.

So, Regexp is remembering those options, even inside the outer pattern, causing the overall pattern to fail the match:

pattern = /A #{ foo_regex }/i # => /A (?-mix:foo)/i
'A Foo' =~ pattern # => nil

It's possible to make sure that all sub-expressions match their surrounding patterns, however that can quickly become too convoluted or messy:

foo_regex = /foo/i
pattern = /A #{ foo_regex }/i # => /A (?i-mx:foo)/i
'A Foo' =~ pattern # => 0

Instead we have the source method which returns the text of a pattern:

/#{ /foo/.source  }/ # => /foo/
/#{ /foo/i.source }/ # => /foo/

The problem with the embedded pattern remembering the options also appears when using other Regexp methods, such as union:

/#{ Regexp.union(%w[a b]) }/ # => /(?-mix:a|b)/

and again, source can help:

/#{ Regexp.union(%w[a b]).source }/ # => /a|b/

Knowing all that:

foo_regex = /foo/
pattern = /#{ foo_regex.source }/i # => /foo/i
'A Foo' =~ pattern # => 2
like image 164
2 revs Avatar answered Oct 06 '22 09:10

2 revs