Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace specific characters in a variable in data frame in R

I want to replace all ,, -, ), ( and (space) with . from the variable DMA.NAME in the example data frame. I referred to three posts and tried their approaches but all failed.:

Replacing column values in data frame, not included in list

R replace all particular values in a data frame

Replace characters from a column of a data frame R

Approach 1

> shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")")
c$DMA.NAME[shouldbecomeperiod] <- "."

Approach 2

> removetext <- c("-", ",", " ", "(", ")")
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME)
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE)

Warning message:
In gsub(removetext, ".", c$DMA.NAME) :
  argument 'pattern' has length > 1 and only the first element will be used

Approach 3

> c[c == c(" ", ",", "(", ")", "-")] <- "."

Sample data frame

> df
DMA.CODE                  DATE                   DMA.NAME       count
111         22 8/14/2014 12:00:00 AM               Columbus, OH     1
112         23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn     1
79          18 7/30/2014 12:00:00 AM        Boston (Manchester)     1
99          22 8/20/2014 12:00:00 AM               Columbus, OH     1
112.1       23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn     1
208         27 7/31/2014 12:00:00 AM       Minneapolis-St. Paul     1

I know the problem - gsub uses pattern and only first element . The other two approaches are searching the entire variable for the exact value instead of searching within value for specific characters.

like image 904
vagabond Avatar asked Mar 19 '23 03:03

vagabond


2 Answers

You can use the special groups [:punct:] and [:space:] inside of a pattern group ([...]) like this:

df <- data.frame(
  DMA.NAME = c(
    "Columbus, OH",
    "Orlando-Daytona Bch-Melbrn",
    "Boston (Manchester)",
    "Columbus, OH",
    "Orlando-Daytona Bch-Melbrn",
    "Minneapolis-St. Paul"),
  stringsAsFactors=F)
##
> gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME)
[1] "Columbus.OH"                "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester."         "Columbus.OH"               
[5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"
like image 166
nrussell Avatar answered Mar 20 '23 16:03

nrussell


If your data frame is big you might want to look at this fast function from stringi package. This function replaces every character of specific class for another. In this case character class is L - letters (inside {}), but big P (before {}) indicates that we are looking for the complements of this set, so for every non letter character. Merge indicates that consecutive matches should be merged into a single one.

require(stringi)
stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T)
## [1] "Columbus.OH"                "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester."         "Columbus.OH"               
## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"   

And some benchmarks:

x <- sample(df$DMA.NAME, 1000, T)
gsubFun <- function(x){
    gsub("[[:punct:][:space:]]+","\\.",x)   
}

striFun <- function(x){
    stri_replace_all_charclass(x, "\\P{L}",".", T)  
}


require(microbenchmark)
microbenchmark(gsubFun(x), striFun(x))
Unit: microseconds
       expr      min        lq   median        uq       max neval
 gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984   100
 striFun(x)  877.259  893.3945  907.769  929.8065  3189.017   100
like image 32
bartektartanus Avatar answered Mar 20 '23 16:03

bartektartanus