I would expect gsub
and stringr::str_replace_all
to return the same result in the following, but only gsub
returns the intended result. I am developing a lesson to demonstrate str_replace_all
so I would like to know why it returns a different result here.
txt <- ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n2017** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n2018** 0.70 0"
gsub(".*2017|2018.*", "", txt)
stringr::str_replace_all(txt, ".*2017|2018.*", "")
gsub
returns the intended output (everything before and including 2017
, and after and including 2018
, has been removed).
output of gsub (intended)
[1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
However str_replace_all
only replaces the 2017
and 2018
but leaves the rest, even though the same pattern
is used for both.
output of str_replace_all (not intended)
[1] ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
Why is this the case?
Definitions of sub & gsub: The sub R function replaces the first match in a character string with new characters. The gsub R function replaces all matches in a character string with new characters. In the following tutorial, I’ll explain in two examples how to apply sub and gsub in R.
We can replace only the first occurrence of a particular character using sub () function, it will replace only the first occurrence character in the string Example: R program to replace character in a string using sub () function str_replace_all () is also a function that replaces the character with a particular character in a string.
The gsub function, in contrast, replaces all matches with “c” (i.e. all “a” of our example character string). In Example 1, we replaced only one character pattern (i.e. “a”). However, sometimes we might want to replace multiple patterns with the same new character.
str_replace_all () is also a function that replaces the character with a particular character in a string. It will replace all occurrences of the character. It is available in stringr package. So, we need to install and load the package Example: R program to replace character in a string using str_replace_all () function
Base R relies on two regex libraries.
As default R uses TRE.
We can specify perl = TRUE
to use PCRE (perl like regular expressions).
The {stringr} package uses ICU (Java like regular expressions).
In your case the problem is that the dot .
doesn’t match line breaks in PCRE and ICU, while it does match line breaks in TRE:
library(stringr)
txt <- ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n2017** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n2018** 0.70 0"
(base_tre <- gsub(".*2017|2018.*", "", txt))
#> [1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
(base_perl <- gsub(".*2017|2018.*", "", txt, perl = TRUE))
#> [1] ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
(string_r <- str_replace_all(txt, ".*2017|2018.*", ""))
#> [1] ".72 2.51\n2015** 2.45 2.30 2.00 1.44 1.20 1.54 1.84 1.56 1.94 1.47 0.86 1.01\n2016** 1.53 1.75 2.40 2.62 2.35 2.03 1.25 0.52 0.45 0.56 1.88 1.17\n** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
identical(base_perl, string_r)
#> [1] TRUE
We can use modifiers
to change the behavior of PCRE and ICU regex so that line breaks are matched
by .
. This will produce the same output as with base R TRE:
(base_perl <- gsub("(?s).*2017|2018(?s).*", "", txt, perl = TRUE))
#> [1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
(string_r <- str_replace_all(txt, "(?s).*2017|2018(?s).*", ""))
#> [1] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50\n"
identical(base_perl, string_r)
#> [1] TRUE
Finally, unlike TRE, PCRE and ICU allow us to use look arounds which are also an option to solve the problem
str_match(txt, "(?<=2017).*.(?=\\n2018)")
#> [,1]
#> [1,] "** 0.77 0.70 0.74 1.12 0.88 0.79 0.10 0.09 0.32 0.05 0.15 0.50"
Created on 2021-08-10 by the reprex package (v0.3.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With