Selecting rows or columns with data.table R?

Question

Imagine I have a data.table, for example:

library(data.table) 
RRR <-data.table(1:15,runif(15),rgeom(15,0.5),rbinom(15,2,0.5))

    V1      V2    V3  V4
 1:  1 0.33577273  0  0
 2:  2 0.66739739  2  1
 3:  3 0.07501655  0  0
 4:  4 0.43195663  2  1
 5:  5 0.39525841  3  2
 6:  6 0.15189738  1  1
 7:  7 0.02637279  0  1
 8:  8 0.44165623  0  1
 9:  9 0.98710570  2  0
10: 10 0.62402805  1  0
11: 11 0.84829465  3  2
12: 12 0.02170976  0  1
13: 13 0.74608925  0  2
14: 14 0.29102296  2  0
15: 15 0.83820646  1  1

How can I get a data.table from it, with all the ROWS that contain a "0" at any column? (or some value)
If I had to do it with a single column I could use:

RRR[V4==0,]

   V1    V2      V3  V4
1:  1 0.33577273  0  0
2:  3 0.07501655  0  0
3:  9 0.98710570  2  0
4: 10 0.62402805  1  0
5: 14 0.29102296  2  0

But what if I want to do it with all the columns at once because I have many?

This doesn't do what I need.

RRR[,sapply(RRR,function(xx)(xx==0)), with=TRUE]   

     V1      V2     V3    V4
[1,]  FALSE FALSE  TRUE  TRUE
[2,]  FALSE FALSE FALSE FALSE
[3,]  FALSE FALSE  TRUE  TRUE
[4,]  FALSE FALSE FALSE FALSE
[5,]  FALSE FALSE FALSE FALSE
[6,]  FALSE FALSE FALSE FALSE
[7,]  FALSE FALSE  TRUE FALSE
[8,]  FALSE FALSE  TRUE FALSE
[9,]  FALSE FALSE FALSE  TRUE
[10,] FALSE FALSE FALSE  TRUE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE  TRUE FALSE
[13,] FALSE FALSE  TRUE FALSE
[14,] FALSE FALSE FALSE  TRUE
[15,] FALSE FALSE FALSE FALSE

Maybe with a for loop and some complicated paste?. Though, I would prefer to use simple data.table syntax.

Similarly, how would you get a data.table with all the COLUMNS that contain a '0' at any row?

I know how to get the columns (as a whole) that fulfills a condition, such as being numeric,

RRR[,sapply(RRR,function(xx)is.numeric(xx)),with=FALSE]

but this method doesn't work if I want to test the condition elementwise.

In case anybody is interested, this is the system.time() for a bigger random data.table with the different solutions you provided so far, with slight modifications.

set.seed(1)
n <- 1000000
RRR <- data.table(matrix(rgeom(100*n,0.5), ncol=100))

Getting ROWS   
> RRR[RRR[,rowSums(RRR==0)>0]] 
   user  system elapsed 
   2.72    0.55    3.27 
> RRR[rowSums(RRR==0)>0] 
   user  system elapsed 
   2.58    0.70    3.28 
> RRR[apply(RRR,MAR=1,function(xx)any(xx==0))]
   user  system elapsed 
   10.81    0.19   11.00       
> RRR[apply(RRR[,paste0('V',1:ncol(RRR)),with=FALSE],function(xx)any(xx==0),MAR=1)]
  user  system elapsed 
  10.49    0.30   10.83 

Getting COLUMNS
> RRR[,sapply(RRR,function(xx)any(xx==0)), with=FALSE] 
   user  system elapsed 
   0.81    0.31    1.12 
> `[.listof`(RRR,colSums(RRR==0)>0) 
   user  system elapsed 
   2.14    0.27    2.41 
> RRR[,colSums(RRR==0)>0, with=FALSE] 
   user  system elapsed 
   2.26    0.48    2.75 
> RRR[, .SD, .SDcols=sapply(RRR, function(x) any(x==0))]      #only version 1.9.5, seems the same solution than the first one.
   user  system elapsed 
   0.78    0.36    1.14 
> RRR[, .SD, .SDcols=sapply(RRR, function(x) any(!as.logical(x)))]
   user  system elapsed 
   0.41    0.25    0.66 
> RRR[Reduce('|',lapply(RRR,function(xx)(xx==0)))]
   user  system elapsed 
   3.11    0.33    3.44 
> RRR[,apply(RRR[,paste0('V',1:ncol(RRR)),with=FALSE],function(xx)any(xx==0),MAR=2),with=FALSE]
   user  system elapsed 
   3.48    0.80    4.28

I haven't included yet:

RRR[, i := any(unlist(lapply(.SD, function(x) x==0))), seq_len(nrow(RRR))][i==TRUE][,i:=NULL]

It took several minutes and I stopped it, and it "tags" the rows instead of extracting them and it's the most complex solution.

I'll wait for faster or simpler solutions and hear your comments and likings.

sapply was supposed to be slower but it isn't. The results could change if the data.table contains other kind of data.

We could speed it up if we can stop the test (==0) as soon as the first occurrence happens within every row or column. But I guess we can't do it without loops or some low level access or bitwise operation.

I've thought of a new method.

sapply(RRR,function(xx)which(xx==0))
I need to combine the results of a) with a union of the lists, but I don't know how to do it for any number of columns.
And then get that rows RRR["a)"]

I guess it's gonna be much slower if the number of zeroes is big.

Maybe also try RRR[unique(unlist(sapply(RRR,function(xx)which(xx==0))))] but it's too slow.

An option to get the opposite would be RRR[(RRR==0)] <- NA; na.omit(RRR)

Frank · Accepted Answer

The rowSums function can be used here:

RRR[rowSums(!RRR)>0]

How it works: !RRR is a matrix with TRUE at any zero. In the general case, you can replace !RRR with whatever logical condition you want to check. For example, to see if any element is equal to 3, you could take the rowSums of RRR==3.

I think rowSums(test(x))>0 is essentially the same as apply(RRR,1,function(x)any(!test(x))); both coerce the object to a matrix. I find the rowSums version easier to read and think I've heard people praise its efficiency.

For columns, similarly:

RRR[, colSums(!RRR)>0, with=FALSE]

jangorecki · Answer

Maybe this.

library(data.table) 
RRR <-data.table(1:15,runif(15),rgeom(15,0.5),rbinom(15,2,0.5))
RRR[, i := any(unlist(lapply(.SD, function(x) x==0))), seq_len(nrow(RRR))
    ][i==TRUE
      ][,i:=NULL]

Extending answer for the second part of question.

 RRR[, .SD, .SDcols=sapply(RRR, function(x) any(x==0))]
 # you may add this one also to timing, I wonder how it will work
 RRR[, .SD, .SDcols=sapply(RRR, function(x) any(!as.logical(x)))]

.SDcols as logical vector was introduced quite recently so be sure to update your data.table first.

Selecting rows or columns with data.table R?

Tags:

select

r

data.table

row

skan

2 Answers

Frank

jangorecki

Recent Activity

Donate For Us

Selecting rows or columns with data.table R?

Tags:

select

r

data.table

row

skan

2 Answers

Frank

jangorecki

Related questions

Recent Activity

Donate For Us