Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete last two characters in string if they match criteria

I have 2 million names in a database. For example:

df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE A", "A SCHWARZENEGGER"))

> df
             names
1           A ADAM
2           S BEAN
3        A APPLE A
4 A SCHWARZENEGGER

I want to delete ' A' (white space A) if these are the last two characters of the string.

I know that regex is our friend here. How do I efficiently apply a regex function to the last two characters of the string?

Desired output:

> output
             names
1           A ADAM
2           S BEAN
3          A APPLE
4 A SCHWARZENEGGER
like image 968
wake_wake Avatar asked Dec 04 '22 23:12

wake_wake


1 Answers

If you want good performance for millions of records, the stringi package is what you need. It even outperforms the base R functions:

require(stringi)
n <- 10000
x <- stri_rand_strings(n, 1:100)
ind <- sample(n, n/100)
x[ind] <- stri_paste(x[ind]," A")

baseR <- function(x){
  sub("\\sA$", "", x)
}

stri1 <- function(x){
  stri_replace_last_regex(x, "\\sA$","")
}

stri2 <- function(x){
  ind <- stri_detect_regex(x, "\\sA$")
  x[ind] <- stri_sub(x[ind],1, -3)
  x
}

#if we assume that there can only be space, not any white character
#this is even faster (ca 200x)
stri3 <- function(x){
  ind <- stri_endswith_fixed(x, " A")
  x[ind] <- stri_sub(x[ind],1, -3)
  x
}


head(stri2(x),44)
require(microbenchmark)
microbenchmark(baseR(x), stri1(x),stri2(x),stri3(x))
Unit: microseconds
     expr        min        lq        mean      median         uq        max neval
 baseR(x) 166044.032 172054.30 183919.6684 183112.1765 194586.231 219207.905   100
 stri1(x)  36704.180  39015.59  41836.8612  40164.9365  43773.034  60373.866   100
 stri2(x)  17736.535  18884.56  20575.3306  19818.2895  21759.489  31846.582   100
 stri3(x)    491.963    802.27    918.1626    868.9935   1008.776   2489.923   100
like image 110
bartektartanus Avatar answered Dec 11 '22 16:12

bartektartanus