Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to drop rows with missing values?

Tags:

r

data.table

I'm working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that set being specified by a character vector varcols.

So far I've tried the following:

require(data.table)
x <- CJ(var1=c(1,0,NA),var2=c(1,0,NA))
x[, textcol := letters[1:nrow(x)]]
varcols <- c("var1","var2")

x[, missing := apply(sapply(.SD,is.na),1,any),.SDcols=varcols]
x <- x[!missing]

Is there a faster way of doing this? Thanks.

like image 995
Michael Avatar asked Dec 07 '12 01:12

Michael


3 Answers

This should be faster than using apply:

x[rowSums(is.na(x[, ..varcols])) == 0, ]
#    var1 var2 textcol
# 1:    0    0       e
# 2:    0    1       f
# 3:    1    0       h
# 4:    1    1       i
like image 115
flodel Avatar answered Sep 21 '22 09:09

flodel


Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.

After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.

Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.

cleanmat.cpp

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;
using namespace arma;


// [[Rcpp::export]]
mat cleanmat(mat X, uvec idx) {
    // remove colums
    X = X.cols(idx - 1);
    // get dimensions
    int n = X.n_rows,k = X.n_cols;
    // create keep vector
    vec keep = ones<vec>(n);
    for (int j = 0; j < k; j++) 
        for (int i = 0; i < n; i++) 
            if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
    // alternative with view for each row (slightly slower)
    /*vec keep = zeros<vec>(n);
    for (int i = 0; i < n; i++) {
         keep(i) = is_finite(X.row(i));
    }*/  
    return (X.rows(find(keep==1)));
}


// [[Rcpp::export]]
LogicalVector keep(NumericMatrix X) {
    int n = X.nrow(), k = X.ncol();
    // create keep vector
    LogicalVector keep(n, true);
    for (int j = 0; j < k; j++) 
        for (int i = 0; i < n; i++) 
            if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;

    return (keep);
}


/*** R
require("Rcpp")
require("RcppArmadillo")
require("data.table")
require("microbenchmark")

# create matrix
X = matrix(rnorm(1e+07),ncol=100)
X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
colnames(X)=paste("c",1:ncol(X),sep="")

idx=sample(ncol(X),90)
microbenchmark(
  X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
  X[rowSums(is.na(X[,idx])) == 0, idx],
  cleanmat(X,idx),
  X[keep(X[,idx]),idx],
times=3)

# output
# Unit: milliseconds
#                                                     expr       min        lq    median        uq       max
# 1                                       cleanmat(X, idx)  253.2596  259.7738  266.2880  272.0900  277.8921
# 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
# 3                                 X[keep(X[, idx]), idx]  360.8254  361.5165  362.2077  371.2061  380.2045
# 4                  X[rowSums(is.na(X[, idx])) == 0, idx]  358.4772  367.5698  376.6625  379.6093  382.5561

*/
like image 43
user2503795 Avatar answered Sep 21 '22 09:09

user2503795


For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :

keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]

The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.

1. is.na(x[, ..varcols])
This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.

2. !is.na(var1) & !is.na(var2)
Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.

3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.

So, the very fastest (I guess), would be a C version like this (pseudo-code) :

keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
   for (i=0; i<nrow(x); i++)
       if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]

That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.

like image 33
Matt Dowle Avatar answered Sep 22 '22 09:09

Matt Dowle