Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Backslash + captured group within Ruby regular expression

Tags:

regex

ruby

How do I excape a backslash before a captured group?

Example:

"foo+bar".gsub(/(\+)/, '\\\1')

What I expect (and want):

foo\+bar

what I unfortunately get:

foo\\1bar

How do I escape here correctly?

like image 292
Zardoz Avatar asked Apr 22 '11 12:04

Zardoz


People also ask

How do you backslash in regular expressions?

To insert a backslash into your regular expression pattern, use a double backslash ('\\').

What does backslash number mean in regex?

The backreference \1 (backslash one) references the first capturing group. \1 matches the exact same text that was matched by the first capturing group. The / before it is a literal character. It is simply the forward slash in the closing HTML tag that we are trying to match.

Why backslash is used in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "."

What does =~ mean in Ruby regex?

=~ is Ruby's basic pattern-matching operator. When one operand is a regular expression and the other is a string then the regular expression is used as a pattern to match against the string. (This operator is equivalently defined by Regexp and String so the order of String and Regexp do not matter.


1 Answers

As others have said, you need to escape everything in that string twice. So in your case the solution is to use '\\\\\1' or '\\\\\\1'. But since you asked why, I'll try to explain that part.

The reason is that replacement sequence is being parsed twice--once by Ruby and once by the underlying regular expression engine, for whom \1 is its own escape sequence. (It's probably easier to understand with double-quoted strings, since single quotes introduce an ambiguity where '\\1' and '\1' are equivalent but '\' and '\\' are not.)

So for example, a simple replacement here with a captured group and a double quoted string would be:

"foo+bar".gsub(/(\+)/, "\\1")   #=> "foo+bar"

This passes the string \1 to the regexp engine, which it understands as a reference to a capture group. In Ruby string literals, "\1" means something else entirely (ASCII character 1).

What we actually want in this case is for the regexp engine to receive \\\1. It also understands \ as an escape character, so \\1 is not sufficient and will simply evaluate to the literal output \1. So, we need \\\1 in the regexp engine, but to get to that point we need to also make it past Ruby's string literal parser.

To do that, we take our desired regexp input and double every backslash again to get through Ruby's string literal parser. \\\1 therefore requires "\\\\\\1". In the case of single quotes one slash can be omitted as \1 is not a valid escape sequence in single quotes and is treated literally.

Addendum

One of the reasons this problem is usually hidden is thanks to the use of /.+/ style regexp quotes, which Ruby treats in a special way to avoid the need to double escape everything. (Of course, this doesn't apply to gsub replacement strings.) But you can still see it in action if you use a string literal instead of a regexp literal in Regexp.new:

Regexp.new("\.").match("a")   #=> #<MatchData "a">
Regexp.new("\\.").match("a")  #=> nil

As you can see, we had to double-escape the . for it to be understood as a literal . by the regexp engine, since "." and "\." both evaluate to . in double-quoted strings, but we need the engine itself to receive \..

like image 157
evnkm Avatar answered Sep 19 '22 01:09

evnkm