Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting rows or columns with data.table R?

Imagine I have a data.table, for example:

library(data.table) 
RRR <-data.table(1:15,runif(15),rgeom(15,0.5),rbinom(15,2,0.5))

    V1      V2    V3  V4
 1:  1 0.33577273  0  0
 2:  2 0.66739739  2  1
 3:  3 0.07501655  0  0
 4:  4 0.43195663  2  1
 5:  5 0.39525841  3  2
 6:  6 0.15189738  1  1
 7:  7 0.02637279  0  1
 8:  8 0.44165623  0  1
 9:  9 0.98710570  2  0
10: 10 0.62402805  1  0
11: 11 0.84829465  3  2
12: 12 0.02170976  0  1
13: 13 0.74608925  0  2
14: 14 0.29102296  2  0
15: 15 0.83820646  1  1

How can I get a data.table from it, with all the ROWS that contain a "0" at any column? (or some value)
If I had to do it with a single column I could use:

RRR[V4==0,]

   V1    V2      V3  V4
1:  1 0.33577273  0  0
2:  3 0.07501655  0  0
3:  9 0.98710570  2  0
4: 10 0.62402805  1  0
5: 14 0.29102296  2  0

But what if I want to do it with all the columns at once because I have many?

This doesn't do what I need.

RRR[,sapply(RRR,function(xx)(xx==0)), with=TRUE]   

     V1      V2     V3    V4
[1,]  FALSE FALSE  TRUE  TRUE
[2,]  FALSE FALSE FALSE FALSE
[3,]  FALSE FALSE  TRUE  TRUE
[4,]  FALSE FALSE FALSE FALSE
[5,]  FALSE FALSE FALSE FALSE
[6,]  FALSE FALSE FALSE FALSE
[7,]  FALSE FALSE  TRUE FALSE
[8,]  FALSE FALSE  TRUE FALSE
[9,]  FALSE FALSE FALSE  TRUE
[10,] FALSE FALSE FALSE  TRUE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE  TRUE FALSE
[13,] FALSE FALSE  TRUE FALSE
[14,] FALSE FALSE FALSE  TRUE
[15,] FALSE FALSE FALSE FALSE

Maybe with a for loop and some complicated paste?. Though, I would prefer to use simple data.table syntax.

Similarly, how would you get a data.table with all the COLUMNS that contain a '0' at any row?

I know how to get the columns (as a whole) that fulfills a condition, such as being numeric,

RRR[,sapply(RRR,function(xx)is.numeric(xx)),with=FALSE]

but this method doesn't work if I want to test the condition elementwise.


In case anybody is interested, this is the system.time() for a bigger random data.table with the different solutions you provided so far, with slight modifications.

set.seed(1)
n <- 1000000
RRR <- data.table(matrix(rgeom(100*n,0.5), ncol=100))

Getting ROWS   
> RRR[RRR[,rowSums(RRR==0)>0]] 
   user  system elapsed 
   2.72    0.55    3.27 
> RRR[rowSums(RRR==0)>0] 
   user  system elapsed 
   2.58    0.70    3.28 
> RRR[apply(RRR,MAR=1,function(xx)any(xx==0))]
   user  system elapsed 
   10.81    0.19   11.00       
> RRR[apply(RRR[,paste0('V',1:ncol(RRR)),with=FALSE],function(xx)any(xx==0),MAR=1)]
  user  system elapsed 
  10.49    0.30   10.83 

Getting COLUMNS
> RRR[,sapply(RRR,function(xx)any(xx==0)), with=FALSE] 
   user  system elapsed 
   0.81    0.31    1.12 
> `[.listof`(RRR,colSums(RRR==0)>0) 
   user  system elapsed 
   2.14    0.27    2.41 
> RRR[,colSums(RRR==0)>0, with=FALSE] 
   user  system elapsed 
   2.26    0.48    2.75 
> RRR[, .SD, .SDcols=sapply(RRR, function(x) any(x==0))]      #only version 1.9.5, seems the same solution than the first one.
   user  system elapsed 
   0.78    0.36    1.14 
> RRR[, .SD, .SDcols=sapply(RRR, function(x) any(!as.logical(x)))]
   user  system elapsed 
   0.41    0.25    0.66 
> RRR[Reduce('|',lapply(RRR,function(xx)(xx==0)))]
   user  system elapsed 
   3.11    0.33    3.44 
> RRR[,apply(RRR[,paste0('V',1:ncol(RRR)),with=FALSE],function(xx)any(xx==0),MAR=2),with=FALSE]
   user  system elapsed 
   3.48    0.80    4.28  

I haven't included yet:

RRR[, i := any(unlist(lapply(.SD, function(x) x==0))), seq_len(nrow(RRR))][i==TRUE][,i:=NULL]   

It took several minutes and I stopped it, and it "tags" the rows instead of extracting them and it's the most complex solution.

I'll wait for faster or simpler solutions and hear your comments and likings.

sapply was supposed to be slower but it isn't. The results could change if the data.table contains other kind of data.


We could speed it up if we can stop the test (==0) as soon as the first occurrence happens within every row or column. But I guess we can't do it without loops or some low level access or bitwise operation.

I've thought of a new method.

  1. sapply(RRR,function(xx)which(xx==0))
  2. I need to combine the results of a) with a union of the lists, but I don't know how to do it for any number of columns.
  3. And then get that rows RRR["a)"]

I guess it's gonna be much slower if the number of zeroes is big.

Maybe also try RRR[unique(unlist(sapply(RRR,function(xx)which(xx==0))))] but it's too slow.

An option to get the opposite would be RRR[(RRR==0)] <- NA; na.omit(RRR)

like image 813
skan Avatar asked Apr 23 '15 00:04

skan


2 Answers

The rowSums function can be used here:

RRR[rowSums(!RRR)>0]

How it works: !RRR is a matrix with TRUE at any zero. In the general case, you can replace !RRR with whatever logical condition you want to check. For example, to see if any element is equal to 3, you could take the rowSums of RRR==3.

I think rowSums(test(x))>0 is essentially the same as apply(RRR,1,function(x)any(!test(x))); both coerce the object to a matrix. I find the rowSums version easier to read and think I've heard people praise its efficiency.


For columns, similarly:

RRR[, colSums(!RRR)>0, with=FALSE]
like image 164
Frank Avatar answered Oct 10 '22 12:10

Frank


Maybe this.

library(data.table) 
RRR <-data.table(1:15,runif(15),rgeom(15,0.5),rbinom(15,2,0.5))
RRR[, i := any(unlist(lapply(.SD, function(x) x==0))), seq_len(nrow(RRR))
    ][i==TRUE
      ][,i:=NULL]

Extending answer for the second part of question.

 RRR[, .SD, .SDcols=sapply(RRR, function(x) any(x==0))]
 # you may add this one also to timing, I wonder how it will work
 RRR[, .SD, .SDcols=sapply(RRR, function(x) any(!as.logical(x)))]

.SDcols as logical vector was introduced quite recently so be sure to update your data.table first.

like image 31
jangorecki Avatar answered Oct 10 '22 14:10

jangorecki