Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to replace all blanks with NA in R data.table

Tags:

r

na

data.table

I have a large data.table object (1M rows and 220 columns) and I want to replace all blanks ('') with NA. I found a solution in this Post, but it's extremely slow for my data table (takes already over 15mins) Example from the other post:

 data = data.frame(cats=rep(c('', ' ', 'meow'),1e6),
                   dogs=rep(c("woof", " ", NA),1e6))
 system.time(x<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))

Is there a more data.table fast way to achieve this?

Indeed the provided data does not look much like the original data, it was just to give an example. The following subset of my real data gives the CharToDate(x) error:

DT <- data.table(ID=c(10),DEFAULT_DATE=as.Date("2012-07-31"),value='')
system.time(DT[DT=='']<-NA)
like image 612
Tim_Utrecht Avatar asked Jul 20 '15 12:07

Tim_Utrecht


People also ask

How do I replace blank rows with NA in R?

Replace Empty String with NA in an R DataframeUse df[df==”] to check if the value of a data frame column is an empty string, if it is an empty string you can assign the value NA . The below example replaces all blank string values on all columns with NA.


2 Answers

Here's probably the generic data.table way of doing this. I'm also going to use your regex which handles several types of blanks (I havn't seen other answers doing this). You probably shouldn't run this over all your columns rather only over the factor or character ones, because other classes won't accept blank values.

For factors

indx <- which(sapply(data, is.factor))
for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_) 

For characters

indx2 <- which(sapply(data, is.character)) 
for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)
like image 184
David Arenburg Avatar answered Nov 01 '22 19:11

David Arenburg


Use this approach:

system.time(data[data==''|data==' ']<-NA)
  user  system elapsed 
  1.47    0.19    1.66 

system.time(y<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
  user  system elapsed 
  3.41    0.20    3.64
like image 29
Colonel Beauvel Avatar answered Nov 01 '22 18:11

Colonel Beauvel