I got really confused about the usage of backreferences
strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12")
gsub("(ab) 12", "\\1 34", strings)
[1] "^ab" "ab" "abc" "abd" "abe" "ab 34"
gsub("(ab)12", "\\2 34", strings)
[1] "^ab" "ab" "abc" "abd" "abe" "ab 12"
I know \1 refers to the first subpattern (reading from the left), \2 refers to the second subpattern, and so on. But I dont know what this subpattern means. Why \1 and \2 give different output
gsub("(ab)", "\\1 34", strings)
[1] "^ab 34" "ab 34" "ab 34c" "ab 34d" "ab 34e" "ab 34 12"
Also, why I remove 12 after (ab) then it gives such result?
gsub("ab", "\\1 34", strings)
[1] "^ 34" " 34" " 34c" " 34d" " 34e" " 34 12"
Furthermore, what if ab has no parenthesis? What does it indicate?
I really messed up with backreference and hope someone could explain the logic clearly
A backreference in a regular expression identifies a previously matched group and looks for exactly the same text again. A simple example of the use of backreferences is when you wish to look for adjacent, repeated words in some text. The first part of the match could use a pattern that extracts a single word.
Noun. backreference (plural backreferences) (regular expressions) An item in a regular expression equivalent to the text matched by an earlier pattern in the expression.
The backreference \1 (backslash one) references the first capturing group. \1 matches the exact same text that was matched by the first capturing group. The / before it is a literal character. It is simply the forward slash in the closing HTML tag that we are trying to match.
Normally, within a pattern, you create a back-reference to the content a capture group previously matched by using a backslash followed by the group number—for instance \1 for Group 1.
In the first and second case, there is a single capture group i.e. groups that are captured using (...)
, however in the first case replacement we use the backreference correctly i.e. the first capture group and in the second case, used \\2
which never existed.
To illustrate it
gsub("(ab)(d)", "\\1 34", strings)
#[1] "^ab" "ab" "abc" "ab 34" "abe" "ab 12"
here we are using two capture groups ((ab)
and (d)
), in the replacement we have first backreference (\\1
) followed by a space followed by 34. So, in 'strings' this will match the 4th element i.e. "abd", get "ab" for the first backreference (\\1
) followed by a space and 34.
Suppose, we do with the second backreference
gsub("(ab)(d)", "\\2 34", strings)
#[1] "^ab" "ab" "abc" "d 34" "abe" "ab 12"
the first one is removed and we have "d" followed by space and 34.
Suppose, we are using a general case instead of specific characters
gsub("([a-z]+)\\s*(\\d+)", "\\1 34", strings)
#[1] "^ab" "ab" "abc" "abd" "abe" "ab 34"
gsub("([a-z]+)\\s*(\\d+)", "\\2 34", strings)
#[1] "^ab" "ab" "abc" "abd" "abe" "12 34"
Note how the values are changed in the last element by switching from first backreference to second. The pattern used is one or more lower case letters (inside the capture group (([a-z]+)
) followed by zero or more space (\\s*
) followed by one or more numbers in the second capture group ((\\d+)
) (this matches only with the last element of 'strings'). In the replacement, we use the first and second backreference as showed above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With