Imagine I have a data.table, for example:
library(data.table)
RRR <-data.table(1:15,runif(15),rgeom(15,0.5),rbinom(15,2,0.5))
V1 V2 V3 V4
1: 1 0.33577273 0 0
2: 2 0.66739739 2 1
3: 3 0.07501655 0 0
4: 4 0.43195663 2 1
5: 5 0.39525841 3 2
6: 6 0.15189738 1 1
7: 7 0.02637279 0 1
8: 8 0.44165623 0 1
9: 9 0.98710570 2 0
10: 10 0.62402805 1 0
11: 11 0.84829465 3 2
12: 12 0.02170976 0 1
13: 13 0.74608925 0 2
14: 14 0.29102296 2 0
15: 15 0.83820646 1 1
How can I get a data.table from it, with all the ROWS that contain a "0" at any column? (or some value)
If I had to do it with a single column I could use:
RRR[V4==0,]
V1 V2 V3 V4
1: 1 0.33577273 0 0
2: 3 0.07501655 0 0
3: 9 0.98710570 2 0
4: 10 0.62402805 1 0
5: 14 0.29102296 2 0
But what if I want to do it with all the columns at once because I have many?
This doesn't do what I need.
RRR[,sapply(RRR,function(xx)(xx==0)), with=TRUE]
V1 V2 V3 V4
[1,] FALSE FALSE TRUE TRUE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE TRUE FALSE
[8,] FALSE FALSE TRUE FALSE
[9,] FALSE FALSE FALSE TRUE
[10,] FALSE FALSE FALSE TRUE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE TRUE FALSE
[13,] FALSE FALSE TRUE FALSE
[14,] FALSE FALSE FALSE TRUE
[15,] FALSE FALSE FALSE FALSE
Maybe with a for loop and some complicated paste?. Though, I would prefer to use simple data.table syntax.
Similarly, how would you get a data.table with all the COLUMNS that contain a '0' at any row?
I know how to get the columns (as a whole) that fulfills a condition, such as being numeric,
RRR[,sapply(RRR,function(xx)is.numeric(xx)),with=FALSE]
but this method doesn't work if I want to test the condition elementwise.
In case anybody is interested, this is the system.time() for a bigger random data.table with the different solutions you provided so far, with slight modifications.
set.seed(1)
n <- 1000000
RRR <- data.table(matrix(rgeom(100*n,0.5), ncol=100))
Getting ROWS
> RRR[RRR[,rowSums(RRR==0)>0]]
user system elapsed
2.72 0.55 3.27
> RRR[rowSums(RRR==0)>0]
user system elapsed
2.58 0.70 3.28
> RRR[apply(RRR,MAR=1,function(xx)any(xx==0))]
user system elapsed
10.81 0.19 11.00
> RRR[apply(RRR[,paste0('V',1:ncol(RRR)),with=FALSE],function(xx)any(xx==0),MAR=1)]
user system elapsed
10.49 0.30 10.83
Getting COLUMNS
> RRR[,sapply(RRR,function(xx)any(xx==0)), with=FALSE]
user system elapsed
0.81 0.31 1.12
> `[.listof`(RRR,colSums(RRR==0)>0)
user system elapsed
2.14 0.27 2.41
> RRR[,colSums(RRR==0)>0, with=FALSE]
user system elapsed
2.26 0.48 2.75
> RRR[, .SD, .SDcols=sapply(RRR, function(x) any(x==0))] #only version 1.9.5, seems the same solution than the first one.
user system elapsed
0.78 0.36 1.14
> RRR[, .SD, .SDcols=sapply(RRR, function(x) any(!as.logical(x)))]
user system elapsed
0.41 0.25 0.66
> RRR[Reduce('|',lapply(RRR,function(xx)(xx==0)))]
user system elapsed
3.11 0.33 3.44
> RRR[,apply(RRR[,paste0('V',1:ncol(RRR)),with=FALSE],function(xx)any(xx==0),MAR=2),with=FALSE]
user system elapsed
3.48 0.80 4.28
I haven't included yet:
RRR[, i := any(unlist(lapply(.SD, function(x) x==0))), seq_len(nrow(RRR))][i==TRUE][,i:=NULL]
It took several minutes and I stopped it, and it "tags" the rows instead of extracting them and it's the most complex solution.
I'll wait for faster or simpler solutions and hear your comments and likings.
sapply was supposed to be slower but it isn't. The results could change if the data.table contains other kind of data.
We could speed it up if we can stop the test (==0) as soon as the first occurrence happens within every row or column. But I guess we can't do it without loops or some low level access or bitwise operation.
I've thought of a new method.
I guess it's gonna be much slower if the number of zeroes is big.
Maybe also try RRR[unique(unlist(sapply(RRR,function(xx)which(xx==0))))]
but it's too slow.
An option to get the opposite would be RRR[(RRR==0)] <- NA; na.omit(RRR)
The rowSums
function can be used here:
RRR[rowSums(!RRR)>0]
How it works: !RRR
is a matrix with TRUE
at any zero. In the general case, you can replace !RRR
with whatever logical condition you want to check. For example, to see if any element is equal to 3
, you could take the rowSums
of RRR==3
.
I think rowSums(test(x))>0
is essentially the same as apply(RRR,1,function(x)any(!test(x)))
; both coerce the object to a matrix. I find the rowSums
version easier to read and think I've heard people praise its efficiency.
For columns, similarly:
RRR[, colSums(!RRR)>0, with=FALSE]
Maybe this.
library(data.table)
RRR <-data.table(1:15,runif(15),rgeom(15,0.5),rbinom(15,2,0.5))
RRR[, i := any(unlist(lapply(.SD, function(x) x==0))), seq_len(nrow(RRR))
][i==TRUE
][,i:=NULL]
Extending answer for the second part of question.
RRR[, .SD, .SDcols=sapply(RRR, function(x) any(x==0))]
# you may add this one also to timing, I wonder how it will work
RRR[, .SD, .SDcols=sapply(RRR, function(x) any(!as.logical(x)))]
.SDcols
as logical vector was introduced quite recently so be sure to update your data.table first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With