I have a large data.table object (1M rows and 220 columns) and I want to replace all blanks ('') with NA. I found a solution in this Post, but it's extremely slow for my data table (takes already over 15mins) Example from the other post:
data = data.frame(cats=rep(c('', ' ', 'meow'),1e6),
dogs=rep(c("woof", " ", NA),1e6))
system.time(x<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
Is there a more data.table fast way to achieve this?
Indeed the provided data does not look much like the original data, it was just to give an example. The following subset of my real data gives the CharToDate(x) error:
DT <- data.table(ID=c(10),DEFAULT_DATE=as.Date("2012-07-31"),value='')
system.time(DT[DT=='']<-NA)
Replace Empty String with NA in an R DataframeUse df[df==”] to check if the value of a data frame column is an empty string, if it is an empty string you can assign the value NA . The below example replaces all blank string values on all columns with NA.
Here's probably the generic data.table
way of doing this. I'm also going to use your regex which handles several types of blanks (I havn't seen other answers doing this). You probably shouldn't run this over all your columns rather only over the factor
or character
ones, because other classes won't accept blank values.
For factor
s
indx <- which(sapply(data, is.factor))
for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_)
For character
s
indx2 <- which(sapply(data, is.character))
for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)
Use this approach:
system.time(data[data==''|data==' ']<-NA)
user system elapsed
1.47 0.19 1.66
system.time(y<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
user system elapsed
3.41 0.20 3.64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With