Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up missing search process in a R data.table

Tags:

r

data.table

I am writing a general function for missing value treatment. Data can have Char,numeric,factor and integer type columns. An example of data is as follows

dt<-data.table(
  num1=c(1,2,3,4,NA,5,NA,6),
  num3=c(1,2,3,4,5,6,7,8),
  int1=as.integer(c(NA,NA,102,105,NA,300,400,700)),
  int3=as.integer(c(1,10,102,105,200,300,400,700)),
  cha1=c('a','b','c',NA,NA,'c','d','e'),
  cha3=c('xcda','b','c','miss','no','c','dfg','e'),
  fact1=c('a','b','c',NA,NA,'c','d','e'),
  fact3=c('ad','bd','cc','zz','yy','cc','dd','ed'),
  allm=as.integer(c(NA,NA,NA,NA,NA,NA,NA,NA)),
  miss=as.character(c("","",'c','miss','no','c','dfg','e')),
  miss2=as.integer(c('','',3,4,5,6,7,8)),
  miss3=as.factor(c(".",".",".","c","d","e","f","g")),
  miss4=as.factor(c(NA,NA,'.','.','','','t1','t2')),
  miss5=as.character(c(NA,NA,'.','.','','','t1','t2'))  
)

I was using this code to flag out missing values:

dt[,flag:=ifelse(is.na(miss5)|!nzchar(miss5),1,0)]

But it turns out to be very slow, additionally I have to add logic which could also consider "." as missing. So I am planning to write this for missing value identification

dt[miss5 %in% c(NA,'','.'),flag:=1]

but on a 6 million record set it takes close to 1 second to run this whereas

dt[!nzchar(miss5),flag:=1]  takes close 0.14 secod to run.

My question is, can we have a code where the time taken is as least as possible while we can look for values NA,blank and Dot(NA,".","") as missing?

Any help is highly appreciated.

like image 469
Anuj Avatar asked Jun 23 '15 08:06

Anuj


People also ask

What does setDT do in R?

The setDT() method can be used to coerce the dataframe or the lists into data. table, where the conversion is made to the original dataframe.

Is NA function in R?

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for character and numeric data.

How do I sort a data.table in R?

To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.


2 Answers

== and %in% are optimised to use binary search automatically (NEW FEATURE: Auto indexing). To use it, we have to ensure that:

a) we use dt[...] instead of set() as it's not yet implemented in set(), #1196.

b) When RHS to %in% is of higher SEXPTYPE than LHS, auto indexing re-routes to base R to ensure correct results (as binary search always coerces RHS). So for integer columns we need to make sure we pass in just NA and not the "." or "".

Using @akrun's data, here's the code and run time:

in_col = grep("^miss", names(dt), value=TRUE)
out_col = gsub("^miss", "flag", in_col)
system.time({
    dt[, (out_col) := 0L]
    for (j in seq_along(in_col)) {
        if (class(.subset2(dt, in_col[j])) %in% c("character", "factor")) {
            lookup = c("", ".", NA)
        } else lookup = NA
        expr = call("%in%", as.name(in_col[j]), lookup)
        tt = dt[eval(expr), (out_col[j]) := 1L]
    }
})
#    user  system elapsed 
#   1.174   0.295   1.476 

How it works:

a) we first initiate all output columns to 0.

b) Then, for each column, we check it's type and create lookup accordingly.

c) We then create the corresponding expression for i - miss(.) %in% lookup

d) Then we evaluate the expression in i, which'll use auto indexing to create an index very quickly and use that index to quickly find matching indices using binary search.

Note: If necessary, you can add a set2key(dt, NULL) at the end of for-loop so that the created indices are removed immediately after use (to save space).

Compared to this run, @akrun's fastest answer takes 6.33 seconds, which is ~4.2x speedup.

Update: On 4 million rows and 100 columns, it takes ~ 9.2 seconds. That's ~0.092 seconds per column.

Calling [.data.table a 100 times could be expensive. When auto indexing is implemented in set(), it'd be nice to compare the performance.

like image 116
Arun Avatar answered Nov 01 '22 13:11

Arun


You can loop through the 'miss' columns and create corresponding 'flag' columns with set.

library(data.table)#v1.9.5+
ind <- grep('^miss', names(dt))
nm1 <- sub('miss', 'flag',names(dt)[ind])
dt[,(nm1) := 0]
for(j in seq_along(ind)){
     set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)),j= nm1[j], value=1L)
  }

Benchmarks

set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA,0:9), 6e6*5, replace=TRUE), ncol=5))
set.seed(23)
df2 <- as.data.frame(matrix(sample(c('.','', letters[1:5]), 6e6*5,
   replace=TRUE), ncol=5))
set.seed(234)
i1 <- sample(10)
dfN <- setNames(cbind(df1, df2)[i1], paste0('miss',1:10))
dt <- as.data.table(dfN)

system.time({
 ind <- grep('^miss', names(dt))
 nm1 <- sub('miss', 'flag',names(dt)[ind])
 dt[,(nm1) := 0L]
 for(j in seq_along(ind)){
  set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)), j= nm1[j], value=1L)
  }
 }
)
#user  system elapsed 
#  8.352   0.150   8.496 

system.time({
 m1 <- matrix(0, nrow=6e6, ncol=10)
 m2 <- sapply(seq_along(dt), function(i) {
   ind <- which(dt[[i]] %in% c('.', '', NA))
    replace(m1[,i], ind, 1L)})
  cbind(dt, m2)})
 #user  system elapsed 
 # 14.227   0.362  14.582   
like image 41
akrun Avatar answered Nov 01 '22 14:11

akrun