I want to replace all ,
, -
, )
, (
and (space) with
.
from the variable DMA.NAME in the example data frame. I referred to three posts and tried their approaches but all failed.:
Replacing column values in data frame, not included in list
R replace all particular values in a data frame
Replace characters from a column of a data frame R
Approach 1
> shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")")
c$DMA.NAME[shouldbecomeperiod] <- "."
Approach 2
> removetext <- c("-", ",", " ", "(", ")")
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME)
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE)
Warning message:
In gsub(removetext, ".", c$DMA.NAME) :
argument 'pattern' has length > 1 and only the first element will be used
Approach 3
> c[c == c(" ", ",", "(", ")", "-")] <- "."
Sample data frame
> df
DMA.CODE DATE DMA.NAME count
111 22 8/14/2014 12:00:00 AM Columbus, OH 1
112 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1
79 18 7/30/2014 12:00:00 AM Boston (Manchester) 1
99 22 8/20/2014 12:00:00 AM Columbus, OH 1
112.1 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1
208 27 7/31/2014 12:00:00 AM Minneapolis-St. Paul 1
I know the problem - gsub
uses pattern and only first element . The other two approaches are searching the entire variable for the exact value instead of searching within value for specific characters.
You can use the special groups [:punct:]
and [:space:]
inside of a pattern group ([...]
) like this:
df <- data.frame(
DMA.NAME = c(
"Columbus, OH",
"Orlando-Daytona Bch-Melbrn",
"Boston (Manchester)",
"Columbus, OH",
"Orlando-Daytona Bch-Melbrn",
"Minneapolis-St. Paul"),
stringsAsFactors=F)
##
> gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME)
[1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH"
[5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
If your data frame is big you might want to look at this fast function from stringi
package. This function replaces every character of specific class for another. In this case character class is L
- letters (inside {}
), but big P
(before {}
) indicates that we are looking for the complements of this set, so for every non letter character. Merge indicates that consecutive matches should be merged into a single one.
require(stringi)
stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T)
## [1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH"
## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
And some benchmarks:
x <- sample(df$DMA.NAME, 1000, T)
gsubFun <- function(x){
gsub("[[:punct:][:space:]]+","\\.",x)
}
striFun <- function(x){
stri_replace_all_charclass(x, "\\P{L}",".", T)
}
require(microbenchmark)
microbenchmark(gsubFun(x), striFun(x))
Unit: microseconds
expr min lq median uq max neval
gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984 100
striFun(x) 877.259 893.3945 907.769 929.8065 3189.017 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With