Different behavior of base R gsub and stringr::str_replace_all?

Tags:

I would expect gsub and stringr::str_replace_all to return the same result in the following, but only gsub returns the intended result. I am developing a lesson to demonstrate str_replace_all so I would like to know why it returns a different result here.

txt <- ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n2017**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n2018**   0.70   0"

gsub(".*2017|2018.*", "", txt)

stringr::str_replace_all(txt, ".*2017|2018.*", "")

gsub returns the intended output (everything before and including 2017, and after and including 2018, has been removed).

output of gsub (intended)

[1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

However str_replace_all only replaces the 2017 and 2018 but leaves the rest, even though the same pattern is used for both.

output of str_replace_all (not intended)

[1] ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

Why is this the case?

266

asked Jun 19 '20 13:06

qdread

1 Answers

Base R relies on two regex libraries. As default R uses TRE. We can specify perl = TRUE to use PCRE (perl like regular expressions). The {stringr} package uses ICU (Java like regular expressions).

In your case the problem is that the dot . doesn’t match line breaks in PCRE and ICU, while it does match line breaks in TRE:

library(stringr)

txt <- ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n2017**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n2018**   0.70   0"

(base_tre <- gsub(".*2017|2018.*", "", txt))
#> [1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"
(base_perl <- gsub(".*2017|2018.*", "", txt, perl = TRUE))
#> [1] ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"
(string_r <- str_replace_all(txt, ".*2017|2018.*", ""))
#> [1] ".72   2.51\n2015**   2.45   2.30   2.00   1.44   1.20   1.54   1.84   1.56   1.94   1.47   0.86   1.01\n2016**   1.53   1.75   2.40   2.62   2.35   2.03   1.25   0.52   0.45   0.56   1.88   1.17\n**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

identical(base_perl, string_r)
#> [1] TRUE

We can use modifiers to change the behavior of PCRE and ICU regex so that line breaks are matched by .. This will produce the same output as with base R TRE:

(base_perl <- gsub("(?s).*2017|2018(?s).*", "", txt, perl = TRUE))
#> [1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

(string_r <- str_replace_all(txt, "(?s).*2017|2018(?s).*", ""))
#> [1] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50\n"

identical(base_perl, string_r)
#> [1] TRUE

Finally, unlike TRE, PCRE and ICU allow us to use look arounds which are also an option to solve the problem

str_match(txt, "(?<=2017).*.(?=\\n2018)")
#>      [,1]                                                                                    
#> [1,] "**   0.77   0.70   0.74   1.12   0.88   0.79   0.10   0.09   0.32   0.05   0.15   0.50"

^{Created on 2021-08-10 by the reprex package (v0.3.0)}

134

answered Sep 22 '22 02:09

TimTeaFan

Related questions
                            
                                row_spec() function from kableExtra does not create a horizontal line in html output
                            
                                Unnest list and gather items with purrr
                            
                                Delete the response variable from a formula
                            
                                How do you extract the time unit when using difftime()?
                            
                                A faster (vectorized) way to create a sliding sequence of 1s and 0s with a predetermined length
                            
                                Get non-zero values from string in R
                            
                                How to group by a fixed number of rows in dplyr? [duplicate]
                            
                                Why is the Rcpp implementation in my example much slower than the R function?
                            
                                How to vectorize a subsetting function in R?
                            
                                Using facet tags and strip labels together in ggplot2
                            
                                Problem with import raster package: Unable to load module "spmod"
                            
                                Converting all data.frames in environment to data.tables
                            
                                R - cannot find -llapack & cannot find -lblas
                            
                                Creating a correlation matrix from a data frame in R
                            
                                Using lapply over a list and adding a column with data frame name
                            
                                Count occurences of lists efficiently
                            
                                How to subtract two comma separated columns in R?
                            
                                Non-linear optimisation/programming with integer variables in R
                            
                                How to use submenu in rmarkdown navbar?
                            
                                R ggplot2 - legend at the bottom gets cut, how to find optimal number of columns for the legend on the fly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Different behavior of base R gsub and stringr::str_replace_all?

Tags:

regex

r

string-substitution

stringr

qdread

People also ask

1 Answers

TimTeaFan

Recent Activity

Donate For Us