I've got the following vector:
words <- c("5lang","kasverschil2","b2b")
I want to remove "5"
in "5lang"
and "2"
in "kasverschil2"
. But I do NOT want to remove "2"
in "b2b"
.
To remove dot and number at the end of the string, we can use gsub function. It will search for the pattern of dot and number at the end of the string in the vector then removal of the pattern can be done by using double quotes without space.
Answer : Use [^[:alnum:]] to remove ~! @#$%^&*(){}_+:"<>?,./;'[]-= and use [^a-zA-Z0-9] to remove also â í ü Â á ą ę ś ć in regex or regexpr functions.
To remove the string's first character, we can use the built-in substring() function in R. The substring() function accepts 3 arguments, the first one is a string, the second is start position, third is end position.
gsub("^\\d+|\\d+$", "", words)
#[1] "lang" "kasverschil" "b2b"
Another option would be to use stringi
library(stringi)
stri_replace_all_regex(words, "^\\d+|\\d+$", "")
#[1] "lang" "kasverschil" "b2b"
Using a variant of the data set provided by the OP here are benchmarks for 3 three main solutions (note that these strings are very short and contrived; results may differ on a larger, real data set):
words <- rep(c("5lang","kasverschil2","b2b"), 100000)
library(stringi)
library(microbenchmark)
GSUB <- function() gsub("^\\d+|\\d+$", "", words)
STRINGI <- function() stri_replace_all_regex(words, "^\\d+|\\d+$", "")
GREGEXPR <- function() {
gregexpr(pattern='(^[0-9]+|[0-9]+$)', text = words) -> mm
sapply(regmatches(words, mm, invert=TRUE), paste, collapse="")
}
microbenchmark(
GSUB(),
STRINGI(),
GREGEXPR(),
times=100L
)
## Unit: milliseconds
## expr min lq median uq max neval
## GSUB() 301.0988 349.9952 396.3647 431.6493 632.7568 100
## STRINGI() 465.9099 513.1570 569.1972 629.4176 738.4414 100
## GREGEXPR() 5073.1960 5706.8160 6194.1070 6742.1552 7647.8904 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With