I am looking for ways to speed up my code. I am looking into the apply/ply methods as well as data.table. Unfortunately, I am running into problems.
Here is a small sample data:
ids1   <- c(1, 1, 1, 1, 2, 2, 2, 2)
ids2   <- c(1, 2, 3, 4, 1, 2, 3, 4)
chars1 <- c("aa", " bb ", "__cc__", "dd  ", "__ee", NA,NA, "n/a")
chars2 <- c("vv", "_ ww_", "  xx  ", "yy__", "  zz", NA, "n/a", "n/a")
data   <- data.frame(col1 = ids1, col2 = ids2, 
                 col3 = chars1, col4 = chars2, 
          stringsAsFactors = FALSE)
Here is a solution using loops:
library("plyr")
cols_to_fix <- c("col3","col4")
for (i in 1:length(cols_to_fix)) {
  data[,cols_to_fix[i]] <- gsub("_", "", data[,cols_to_fix[i]])
  data[,cols_to_fix[i]] <- gsub(" ", "", data[,cols_to_fix[i]])
  data[,cols_to_fix[i]] <- ifelse(data[,cols_to_fix[i]]=="n/a", NA, data[,cols_to_fix[i]])
} 
I initially looked at ddply, but some methods I want to use only take vectors.  Hence, I cannot figure out how to do ddply across just certain columns one-by-one.
Also, I have been looking at laply, but I want to return the original data.frame with the changes. Can anyone help me? Thank you.
Based on the suggestions from earlier, here is what I tried to use from the plyr package.
Option 1:
data[,cols_to_fix] <- aaply(data[,cols_to_fix],2, function(x){
   x <- gsub("_", "", x,perl=TRUE)
   x <- gsub(" ", "", x,perl=TRUE)
   x <- ifelse(x=="n/a", NA, x)
},.progress = "text",.drop = FALSE)
Option 2:
data[,cols_to_fix] <- alply(data[,cols_to_fix],2, function(x){
   x <- gsub("_", "", x,perl=TRUE)
   x <- gsub(" ", "", x,perl=TRUE)
   x <- ifelse(x=="n/a", NA, x)
},.progress = "text")
Option 3:
data[,cols_to_fix] <- adply(data[,cols_to_fix],2, function(x){
   x <- gsub("_", "", x,perl=TRUE)
   x <- gsub(" ", "", x,perl=TRUE)
   x <- ifelse(x=="n/a", NA, x)
},.progress = "text")
None of these are giving me the correct answer.
apply works great, but my data is very large and the progress bars from plyr package would be a very nice. Thanks again.
plyr is retired: this means only changes necessary to keep it on CRAN will be made. We recommend using dplyr (for data frames) or purrr (for lists) instead.
You can use the apply() function to apply a function to each row in a matrix or data frame in R. where: X: Name of the matrix or data frame. MARGIN: Dimension to perform operation across.
Here's a data.table solution using set. 
require(data.table)
DT <- data.table(data)
for (j in cols_to_fix) {
    set(DT, i=NULL, j=j, value=gsub("[ _]", "", DT[[j]], perl=TRUE))
    set(DT, i=which(DT[[j]] == "n/a"), j=j, value=NA_character_)
}
DT
#    col1 col2 col3 col4
# 1:    1    1   aa   vv
# 2:    1    2   bb   ww
# 3:    1    3   cc   xx
# 4:    1    4   dd   yy
# 5:    2    1   ee   zz
# 6:    2    2   NA   NA
# 7:    2    3   NA   NA
# 8:    2    4   NA   NA
First line reads: set in DT for all i(=NULL), and column=j the value gsub(..).
Second line reads: set in DT where i(=condn) and column=j with value NA_character_.
Note: Using PCRE (perl=TRUE) has nice speed-up, especially on bigger vectors. 
Here is a data.table solution, should be faster if your table is large.
The concept of := is an "update" of the columns. I believe that because of this you aren't copying the table internally again as a "normal" dataframe solution would.
require(data.table)
DT <- data.table(data)
fxn = function(col) {
  col = gsub("[ _]", "", col, perl = TRUE)
  col[which(col == "n/a")] <- NA_character_
  col
}
cols = c("col3", "col4");
# lapply your function
DT[, (cols) := lapply(.SD, fxn), .SDcols = cols]
print(DT)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With