I got really confused about the usage of backreferences <pre class="prettyprint"><code>strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12") gsub("(ab) 12", "\\1 34", strings) [1] "^ab" "ab" "abc" "abd" "abe" "ab 34" gsub("(ab)12", "\\2 34", strings) [1] "^ab" "ab" "abc" "abd" "abe" "ab 12" </code></pre> I know \1 refers to the first subpattern (reading from the left), \2 refers to the second subpattern, and so on. But I dont know what this subpattern means. Why \1 and \2 give different output <pre class="prettyprint"><code>gsub("(ab)", "\\1 34", strings) [1] "^ab 34" "ab 34" "ab 34c" "ab 34d" "ab 34e" "ab 34 12" </code></pre> Also, why I remove 12 after (ab) then it gives such result? <pre class="prettyprint"><code>gsub("ab", "\\1 34", strings) [1] "^ 34" " 34" " 34c" " 34d" " 34e" " 34 12" </code></pre> Furthermore, what if ab has no parenthesis? What does it indicate? I really messed up with backreference and hope someone could explain the logic clearly

In the first and second case, there is a single capture group i.e. groups that are captured using <code>(...)</code>, however in the first case replacement we use the backreference correctly i.e. the first capture group and in the second case, used <code>\\2</code> which never existed. To illustrate it <pre class="prettyprint"><code>gsub("(ab)(d)", "\\1 34", strings) #[1] "^ab" "ab" "abc" "ab 34" "abe" "ab 12" </code></pre> here we are using two capture groups (<code>(ab)</code> and <code>(d)</code>), in the replacement we have first backreference (<code>\\1</code>) followed by a space followed by 34. So, in 'strings' this will match the 4th element i.e. "abd", get "ab" for the first backreference (<code>\\1</code>) followed by a space and 34. Suppose, we do with the second backreference <pre class="prettyprint"><code>gsub("(ab)(d)", "\\2 34", strings) #[1] "^ab" "ab" "abc" "d 34" "abe" "ab 12" </code></pre> the first one is removed and we have "d" followed by space and 34. Suppose, we are using a general case instead of specific characters <pre class="prettyprint"><code>gsub("([a-z]+)\\s*(\\d+)", "\\1 34", strings) #[1] "^ab" "ab" "abc" "abd" "abe" "ab 34" gsub("([a-z]+)\\s*(\\d+)", "\\2 34", strings) #[1] "^ab" "ab" "abc" "abd" "abe" "12 34" </code></pre> Note how the values are changed in the last element by switching from first backreference to second. The pattern used is one or more lower case letters (inside the capture group (<code>([a-z]+)</code>) followed by zero or more space (<code>\\s*</code>) followed by one or more numbers in the second capture group (<code>(\\d+)</code>) (this matches only with the last element of 'strings'). In the replacement, we use the first and second backreference as showed above.

Backreference in R

Tags:

regex

r

gsub

backreference

I got really confused about the usage of backreferences

strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12")

gsub("(ab) 12", "\\1 34", strings)
[1] "^ab"   "ab"    "abc"   "abd"   "abe"   "ab 34"

gsub("(ab)12", "\\2 34", strings)
[1] "^ab"   "ab"    "abc"   "abd"   "abe"   "ab 12"

I know \1 refers to the first subpattern (reading from the left), \2 refers to the second subpattern, and so on. But I dont know what this subpattern means. Why \1 and \2 give different output

gsub("(ab)", "\\1 34", strings)
[1] "^ab 34"   "ab 34"    "ab 34c"   "ab 34d"   "ab 34e"   "ab 34 12"

Also, why I remove 12 after (ab) then it gives such result?

gsub("ab", "\\1 34", strings)
[1] "^ 34"   " 34"    " 34c"   " 34d"   " 34e"   " 34 12"

Furthermore, what if ab has no parenthesis? What does it indicate?

I really messed up with backreference and hope someone could explain the logic clearly

376

asked Jul 31 '16 07:07

Bratt Swan

1 Answers

In the first and second case, there is a single capture group i.e. groups that are captured using (...), however in the first case replacement we use the backreference correctly i.e. the first capture group and in the second case, used \\2 which never existed.

To illustrate it

gsub("(ab)(d)", "\\1 34", strings)
#[1] "^ab"   "ab"    "abc"   "ab 34" "abe"   "ab 12"

here we are using two capture groups ((ab) and (d)), in the replacement we have first backreference (\\1) followed by a space followed by 34. So, in 'strings' this will match the 4th element i.e. "abd", get "ab" for the first backreference (\\1) followed by a space and 34.

Suppose, we do with the second backreference

gsub("(ab)(d)", "\\2 34", strings)
#[1] "^ab"   "ab"    "abc"   "d 34"  "abe"   "ab 12"

the first one is removed and we have "d" followed by space and 34.

Suppose, we are using a general case instead of specific characters

gsub("([a-z]+)\\s*(\\d+)", "\\1 34", strings)
#[1] "^ab"   "ab"    "abc"   "abd"   "abe"   "ab 34"
gsub("([a-z]+)\\s*(\\d+)", "\\2 34", strings)
#[1] "^ab"   "ab"    "abc"   "abd"   "abe"   "12 34"

Note how the values are changed in the last element by switching from first backreference to second. The pattern used is one or more lower case letters (inside the capture group (([a-z]+)) followed by zero or more space (\\s*) followed by one or more numbers in the second capture group ((\\d+)) (this matches only with the last element of 'strings'). In the replacement, we use the first and second backreference as showed above.

answered Sep 19 '22 01:09

akrun

Related questions
                            
                                How to fill colors in some specific area in R?
                            
                                How can I use dplyr/magrittr's pipe inside functions in R?
                            
                                devtools build_vignette can't find functions
                            
                                Create a different color scale for each bar in a ggplot2 stacked bar graph
                            
                                How to make R package recommend a package hosted on GitHub?
                            
                                Aggregate one data frame by time intervals from another data frame
                            
                                sequence of monthly dates making sure it's the same day, or the last day of month in case of invalid
                            
                                How to calculate the mean of the top 10% in R
                            
                                Should I reset Java heap space maximum after use?
                            
                                remove known exact row in huge csv
                            
                                Open a dta file in R
                            
                                Measure distance between the first and last location record per day and animal in R
                            
                                R: Producing frequency table by selecting certain rows
                            
                                Assign a vector to a specific existing row of data table in R
                            
                                Gzip error when reading R data files into julia
                            
                                Lag / lead by group in R and dplyr
                            
                                Major and minor tickmarks with plotly
                            
                                dplyr's filter function: how to return every value (or «cancel» the effect of filter)?
                            
                                Creating data partition in R
                            
                                Perfectly align several plots

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With