I am writing a general function for missing value treatment. Data can have Char,numeric,factor and integer type columns. An example of data is as follows
dt<-data.table(
num1=c(1,2,3,4,NA,5,NA,6),
num3=c(1,2,3,4,5,6,7,8),
int1=as.integer(c(NA,NA,102,105,NA,300,400,700)),
int3=as.integer(c(1,10,102,105,200,300,400,700)),
cha1=c('a','b','c',NA,NA,'c','d','e'),
cha3=c('xcda','b','c','miss','no','c','dfg','e'),
fact1=c('a','b','c',NA,NA,'c','d','e'),
fact3=c('ad','bd','cc','zz','yy','cc','dd','ed'),
allm=as.integer(c(NA,NA,NA,NA,NA,NA,NA,NA)),
miss=as.character(c("","",'c','miss','no','c','dfg','e')),
miss2=as.integer(c('','',3,4,5,6,7,8)),
miss3=as.factor(c(".",".",".","c","d","e","f","g")),
miss4=as.factor(c(NA,NA,'.','.','','','t1','t2')),
miss5=as.character(c(NA,NA,'.','.','','','t1','t2'))
)
I was using this code to flag out missing values:
dt[,flag:=ifelse(is.na(miss5)|!nzchar(miss5),1,0)]
But it turns out to be very slow, additionally I have to add logic which could also consider "." as missing. So I am planning to write this for missing value identification
dt[miss5 %in% c(NA,'','.'),flag:=1]
but on a 6 million record set it takes close to 1 second to run this whereas
dt[!nzchar(miss5),flag:=1] takes close 0.14 secod to run.
My question is, can we have a code where the time taken is as least as possible while we can look for values NA,blank and Dot(NA,".","") as missing?
Any help is highly appreciated.
The setDT() method can be used to coerce the dataframe or the lists into data. table, where the conversion is made to the original dataframe.
In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for character and numeric data.
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.
==
and %in%
are optimised to use binary search automatically (NEW FEATURE: Auto indexing). To use it, we have to ensure that:
a) we use dt[...]
instead of set()
as it's not yet implemented in set()
, #1196.
b) When RHS to %in%
is of higher SEXPTYPE than LHS, auto indexing re-routes to base R to ensure correct results (as binary search always coerces RHS). So for integer columns we need to make sure we pass in just NA
and not the "."
or ""
.
Using @akrun's data, here's the code and run time:
in_col = grep("^miss", names(dt), value=TRUE)
out_col = gsub("^miss", "flag", in_col)
system.time({
dt[, (out_col) := 0L]
for (j in seq_along(in_col)) {
if (class(.subset2(dt, in_col[j])) %in% c("character", "factor")) {
lookup = c("", ".", NA)
} else lookup = NA
expr = call("%in%", as.name(in_col[j]), lookup)
tt = dt[eval(expr), (out_col[j]) := 1L]
}
})
# user system elapsed
# 1.174 0.295 1.476
How it works:
a) we first initiate all output columns to 0.
b) Then, for each column, we check it's type and create lookup
accordingly.
c) We then create the corresponding expression for i
- miss(.) %in% lookup
d) Then we evaluate the expression in i
, which'll use auto indexing to create an index very quickly and use that index to quickly find matching indices using binary search.
Note: If necessary, you can add a
set2key(dt, NULL)
at the end of for-loop so that the created indices are removed immediately after use (to save space).
Compared to this run, @akrun's fastest answer takes 6.33 seconds, which is ~4.2x speedup.
Update: On 4 million rows and 100 columns, it takes ~ 9.2 seconds. That's ~0.092 seconds per column.
Calling [.data.table
a 100 times could be expensive. When auto indexing is implemented in set()
, it'd be nice to compare the performance.
You can loop through the 'miss' columns and create corresponding 'flag' columns with set
.
library(data.table)#v1.9.5+
ind <- grep('^miss', names(dt))
nm1 <- sub('miss', 'flag',names(dt)[ind])
dt[,(nm1) := 0]
for(j in seq_along(ind)){
set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)),j= nm1[j], value=1L)
}
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA,0:9), 6e6*5, replace=TRUE), ncol=5))
set.seed(23)
df2 <- as.data.frame(matrix(sample(c('.','', letters[1:5]), 6e6*5,
replace=TRUE), ncol=5))
set.seed(234)
i1 <- sample(10)
dfN <- setNames(cbind(df1, df2)[i1], paste0('miss',1:10))
dt <- as.data.table(dfN)
system.time({
ind <- grep('^miss', names(dt))
nm1 <- sub('miss', 'flag',names(dt)[ind])
dt[,(nm1) := 0L]
for(j in seq_along(ind)){
set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)), j= nm1[j], value=1L)
}
}
)
#user system elapsed
# 8.352 0.150 8.496
system.time({
m1 <- matrix(0, nrow=6e6, ncol=10)
m2 <- sapply(seq_along(dt), function(i) {
ind <- which(dt[[i]] %in% c('.', '', NA))
replace(m1[,i], ind, 1L)})
cbind(dt, m2)})
#user system elapsed
# 14.227 0.362 14.582
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With