Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different behavior of base R gsub and stringr::str_replace_all?

I would expect gsub and stringr::str_replace_all to return the same result in the following, but only gsub returns the intended result. I am developing a lesson to demonstrate str_replace_all so I would like to know why it returns a different result here.

txt <- ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n2017**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n2018**   0.70   0"

gsub(".*2017|2018.*", "", txt)

stringr::str_replace_all(txt, ".*2017|2018.*", "")

gsub returns the intended output (everything before and including 2017, and after and including 2018, has been removed).

output of gsub (intended)

[1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

However str_replace_all only replaces the 2017 and 2018 but leaves the rest, even though the same pattern is used for both.

output of str_replace_all (not intended)

[1] ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

Why is this the case?

like image 266
qdread Avatar asked Jun 19 '20 13:06

qdread


People also ask

What is sub and GSUB in R?

Definitions of sub & gsub: The sub R function replaces the first match in a character string with new characters. The gsub R function replaces all matches in a character string with new characters. In the following tutorial, I’ll explain in two examples how to apply sub and gsub in R.

How to replace character in string using sub () function in R?

We can replace only the first occurrence of a particular character using sub () function, it will replace only the first occurrence character in the string Example: R program to replace character in a string using sub () function str_replace_all () is also a function that replaces the character with a particular character in a string.

How does GSUB work in Python?

The gsub function, in contrast, replaces all matches with “c” (i.e. all “a” of our example character string). In Example 1, we replaced only one character pattern (i.e. “a”). However, sometimes we might want to replace multiple patterns with the same new character.

How to replace character in string using STR_replace_all () function in R?

str_replace_all () is also a function that replaces the character with a particular character in a string. It will replace all occurrences of the character. It is available in stringr package. So, we need to install and load the package Example: R program to replace character in a string using str_replace_all () function


1 Answers

Base R relies on two regex libraries. As default R uses TRE. We can specify perl = TRUE to use PCRE (perl like regular expressions). The {stringr} package uses ICU (Java like regular expressions).

In your case the problem is that the dot . doesn’t match line breaks in PCRE and ICU, while it does match line breaks in TRE:

library(stringr)

txt <- ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n2017**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n2018**   0.70   0"

(base_tre <- gsub(".*2017|2018.*", "", txt))
#> [1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"
(base_perl <- gsub(".*2017|2018.*", "", txt, perl = TRUE))
#> [1] ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"
(string_r <- str_replace_all(txt, ".*2017|2018.*", ""))
#> [1] ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

identical(base_perl, string_r)
#> [1] TRUE

We can use modifiers to change the behavior of PCRE and ICU regex so that line breaks are matched by .. This will produce the same output as with base R TRE:

(base_perl <- gsub("(?s).*2017|2018(?s).*", "", txt, perl = TRUE))
#> [1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

(string_r <- str_replace_all(txt, "(?s).*2017|2018(?s).*", ""))
#> [1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

identical(base_perl, string_r)
#> [1] TRUE

Finally, unlike TRE, PCRE and ICU allow us to use look arounds which are also an option to solve the problem

str_match(txt, "(?<=2017).*.(?=\\n2018)")
#>      [,1]                                                                                    
#> [1,] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50"

Created on 2021-08-10 by the reprex package (v0.3.0)

like image 134
TimTeaFan Avatar answered Sep 22 '22 02:09

TimTeaFan