I am writing a general function for missing value treatment. Data can have Char,numeric,factor and integer type columns. An example of data is as follows <pre class="prettyprint"><code>dt<-data.table( num1=c(1,2,3,4,NA,5,NA,6), num3=c(1,2,3,4,5,6,7,8), int1=as.integer(c(NA,NA,102,105,NA,300,400,700)), int3=as.integer(c(1,10,102,105,200,300,400,700)), cha1=c('a','b','c',NA,NA,'c','d','e'), cha3=c('xcda','b','c','miss','no','c','dfg','e'), fact1=c('a','b','c',NA,NA,'c','d','e'), fact3=c('ad','bd','cc','zz','yy','cc','dd','ed'), allm=as.integer(c(NA,NA,NA,NA,NA,NA,NA,NA)), miss=as.character(c("","",'c','miss','no','c','dfg','e')), miss2=as.integer(c('','',3,4,5,6,7,8)), miss3=as.factor(c(".",".",".","c","d","e","f","g")), miss4=as.factor(c(NA,NA,'.','.','','','t1','t2')), miss5=as.character(c(NA,NA,'.','.','','','t1','t2')) ) </code></pre> I was using this code to flag out missing values: <pre class="prettyprint"><code>dt[,flag:=ifelse(is.na(miss5)|!nzchar(miss5),1,0)] </code></pre> But it turns out to be very slow, additionally I have to add logic which could also consider "." as missing. So I am planning to write this for missing value identification <pre class="prettyprint"><code>dt[miss5 %in% c(NA,'','.'),flag:=1] </code></pre> but on a 6 million record set it takes close to 1 second to run this whereas <pre class="prettyprint"><code>dt[!nzchar(miss5),flag:=1] takes close 0.14 secod to run. </code></pre> My question is, can we have a code where the time taken is as least as possible while we can look for values NA,blank and Dot(NA,".","") as missing? Any help is highly appreciated.

<code>==</code> and <code>%in%</code> are optimised to use binary search automatically (NEW FEATURE: Auto indexing). To use it, we have to ensure that: a) we use <code>dt[...]</code> instead of <code>set()</code> as it's not yet implemented in <code>set()</code>, #1196. b) When RHS to <code>%in%</code> is of higher SEXPTYPE than LHS, auto indexing re-routes to base R to ensure correct results (as binary search always coerces RHS). So for integer columns we need to make sure we pass in just <code>NA</code> and not the <code>"."</code> or <code>""</code>. Using @akrun's data, here's the code and run time: <pre class="prettyprint"><code>in_col = grep("^miss", names(dt), value=TRUE) out_col = gsub("^miss", "flag", in_col) system.time({ dt[, (out_col) := 0L] for (j in seq_along(in_col)) { if (class(.subset2(dt, in_col[j])) %in% c("character", "factor")) { lookup = c("", ".", NA) } else lookup = NA expr = call("%in%", as.name(in_col[j]), lookup) tt = dt[eval(expr), (out_col[j]) := 1L] } }) # user system elapsed # 1.174 0.295 1.476 </code></pre> How it works: a) we first initiate all output columns to 0. b) Then, for each column, we check it's type and create <code>lookup</code> accordingly. c) We then create the corresponding expression for <code>i</code> - <code>miss(.) %in% lookup</code> d) Then we evaluate the expression in <code>i</code>, which'll use auto indexing to create an index very quickly and use that index to quickly find matching indices using binary search. <blockquote> Note: If necessary, you can add a <code>set2key(dt, NULL)</code> at the end of for-loop so that the created indices are removed immediately after use (to save space). </blockquote> Compared to this run, @akrun's fastest answer takes 6.33 seconds, which is ~4.2x speedup. Update: On 4 million rows and 100 columns, it takes ~ 9.2 seconds. That's ~0.092 seconds per column. Calling <code>[.data.table</code> a 100 times could be expensive. When auto indexing is implemented in <code>set()</code>, it'd be nice to compare the performance.

How to speed up missing search process in a R data.table

Tags:

r

data.table

I am writing a general function for missing value treatment. Data can have Char,numeric,factor and integer type columns. An example of data is as follows

dt<-data.table(
  num1=c(1,2,3,4,NA,5,NA,6),
  num3=c(1,2,3,4,5,6,7,8),
  int1=as.integer(c(NA,NA,102,105,NA,300,400,700)),
  int3=as.integer(c(1,10,102,105,200,300,400,700)),
  cha1=c('a','b','c',NA,NA,'c','d','e'),
  cha3=c('xcda','b','c','miss','no','c','dfg','e'),
  fact1=c('a','b','c',NA,NA,'c','d','e'),
  fact3=c('ad','bd','cc','zz','yy','cc','dd','ed'),
  allm=as.integer(c(NA,NA,NA,NA,NA,NA,NA,NA)),
  miss=as.character(c("","",'c','miss','no','c','dfg','e')),
  miss2=as.integer(c('','',3,4,5,6,7,8)),
  miss3=as.factor(c(".",".",".","c","d","e","f","g")),
  miss4=as.factor(c(NA,NA,'.','.','','','t1','t2')),
  miss5=as.character(c(NA,NA,'.','.','','','t1','t2'))  
)

I was using this code to flag out missing values:

dt[,flag:=ifelse(is.na(miss5)|!nzchar(miss5),1,0)]

But it turns out to be very slow, additionally I have to add logic which could also consider "." as missing. So I am planning to write this for missing value identification

dt[miss5 %in% c(NA,'','.'),flag:=1]

but on a 6 million record set it takes close to 1 second to run this whereas

dt[!nzchar(miss5),flag:=1]  takes close 0.14 secod to run.

My question is, can we have a code where the time taken is as least as possible while we can look for values NA,blank and Dot(NA,".","") as missing?

Any help is highly appreciated.

469

asked Jun 23 '15 08:06

Anuj

2 Answers

== and %in% are optimised to use binary search automatically (NEW FEATURE: Auto indexing). To use it, we have to ensure that:

a) we use dt[...] instead of set() as it's not yet implemented in set(), #1196.

b) When RHS to %in% is of higher SEXPTYPE than LHS, auto indexing re-routes to base R to ensure correct results (as binary search always coerces RHS). So for integer columns we need to make sure we pass in just NA and not the "." or "".

Using @akrun's data, here's the code and run time:

in_col = grep("^miss", names(dt), value=TRUE)
out_col = gsub("^miss", "flag", in_col)
system.time({
    dt[, (out_col) := 0L]
    for (j in seq_along(in_col)) {
        if (class(.subset2(dt, in_col[j])) %in% c("character", "factor")) {
            lookup = c("", ".", NA)
        } else lookup = NA
        expr = call("%in%", as.name(in_col[j]), lookup)
        tt = dt[eval(expr), (out_col[j]) := 1L]
    }
})
#    user  system elapsed 
#   1.174   0.295   1.476

How it works:

a) we first initiate all output columns to 0.

b) Then, for each column, we check it's type and create lookup accordingly.

c) We then create the corresponding expression for i - miss(.) %in% lookup

d) Then we evaluate the expression in i, which'll use auto indexing to create an index very quickly and use that index to quickly find matching indices using binary search.

Note: If necessary, you can add a set2key(dt, NULL) at the end of for-loop so that the created indices are removed immediately after use (to save space).

Compared to this run, @akrun's fastest answer takes 6.33 seconds, which is ~4.2x speedup.

Update: On 4 million rows and 100 columns, it takes ~ 9.2 seconds. That's ~0.092 seconds per column.

Calling [.data.table a 100 times could be expensive. When auto indexing is implemented in set(), it'd be nice to compare the performance.

116

answered Nov 01 '22 13:11

Arun

You can loop through the 'miss' columns and create corresponding 'flag' columns with set.

library(data.table)#v1.9.5+
ind <- grep('^miss', names(dt))
nm1 <- sub('miss', 'flag',names(dt)[ind])
dt[,(nm1) := 0]
for(j in seq_along(ind)){
     set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)),j= nm1[j], value=1L)
  }

Benchmarks

set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA,0:9), 6e6*5, replace=TRUE), ncol=5))
set.seed(23)
df2 <- as.data.frame(matrix(sample(c('.','', letters[1:5]), 6e6*5,
   replace=TRUE), ncol=5))
set.seed(234)
i1 <- sample(10)
dfN <- setNames(cbind(df1, df2)[i1], paste0('miss',1:10))
dt <- as.data.table(dfN)

system.time({
 ind <- grep('^miss', names(dt))
 nm1 <- sub('miss', 'flag',names(dt)[ind])
 dt[,(nm1) := 0L]
 for(j in seq_along(ind)){
  set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)), j= nm1[j], value=1L)
  }
 }
)
#user  system elapsed 
#  8.352   0.150   8.496 

system.time({
 m1 <- matrix(0, nrow=6e6, ncol=10)
 m2 <- sapply(seq_along(dt), function(i) {
   ind <- which(dt[[i]] %in% c('.', '', NA))
    replace(m1[,i], ind, 1L)})
  cbind(dt, m2)})
 #user  system elapsed 
 # 14.227   0.362  14.582

answered Nov 01 '22 14:11

akrun

Related questions
                            
                                R - Removing tick mark without removing label
                            
                                Passing user specifications as arguments to dplyr within Shiny
                            
                                Calling C code from an R package, within C
                            
                                R- Calculate a count of items over time using start and end dates
                            
                                Change default alignment in pander (pandoc.table)
                            
                                prcomp and ggbiplot: invalid 'rot' value
                            
                                Single text file with multiple tables
                            
                                How can I split a dataframe into odd and even years?
                            
                                Spline interpolation with R
                            
                                Programmatically calling group_by() on a varying variable
                            
                                How to format dygraphs labels in R - comma separate thousands place?
                            
                                color and change line type in a ggplot by group
                            
                                R: Add a curve,with my own equation, to an x,y scatterplot
                            
                                Reshaping a Table in R - Better Approach?
                            
                                ggplot2 - combining shape and color legend with common title
                            
                                Background bands with ggplot in R
                            
                                Combine a function and for loop
                            
                                dplyr rename not working with regular expression
                            
                                data table string concatenation of SD columns for by group values
                            
                                Sorting a key,value list in R by value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With