Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table: subsetting a grouping variable in j with keyby

Say I have this dataset

test <- data.table(X = rep(1, 3), Y = rep("a", 3))

which gives us

test
#   X Y
#1: 1 a
#2: 1 a
#3: 1 a

I'm wondering why

test[, X[Y == "a"], keyby = .(X)]

gives

#   X V1
#1: 1  1
#2: 1 NA
#3: 1 NA

Thank you in advance for your answers!

like image 578
minutestatistique Avatar asked Apr 07 '21 21:04

minutestatistique


3 Answers

If you run X and Y=="a" separately

> test[, X, keyby = .(X)]
   X X
1: 1 1

> test[, Y == "a", keyby = .(X)]
   X   V1
1: 1 TRUE
2: 1 TRUE
3: 1 TRUE

you will see that, the first one gives numeric value 1 of length 1, and the second one gives logical values TRUE of length 3.

Since you don't have matched lengths for subsetting, you will obtain NAs to fill in the corresponding places, e.g.,

> 1[rep(TRUE,3)]
[1]  1 NA NA
like image 72
ThomasIsCoding Avatar answered Oct 17 '22 06:10

ThomasIsCoding


It returns 2 in uniqueN because there are two values - 1) the 'X' grouping value 1 and the NA filled up. We could use na.rm = TRUE in uniqueN

test[, uniqueN(X[Y == "a"],  na.rm = TRUE), keyby = .(X)]
#   X V1
#1: 1  1

As mentioned in @ThomasIsCoding post, it the mismatch in length between the logical vector and the length of grouping variable (which returns length 1) cause the filling of additional TRUE positions with NA. An option would be to replicate

test[, rep(X, .N)[Y == "a"], keyby = .(X)]
#   X V1
#1: 1  1
#2: 1  1
#3: 1  1
like image 1
akrun Avatar answered Oct 17 '22 04:10

akrun


Well, its complicated, in a way.

It has to do with what X is inside a grouping.

Consider these variations:

description expression
Yours test[, X[Y == "a"], keyby=.(X) ]
X only test[, X, keyby=.(X) ]
Y=="a" only test[, Y == "a", keyby=.(X) ]

X only gives:


> test[, X, keyby=.(X) ]
   X X
1: 1 1

This is what 'X' is inside your grouping. Only that one value.

The third expression:


> test[, Y == "a", keyby=.(X) ]
   X   V1
1: 1 TRUE
2: 1 TRUE
3: 1 TRUE

There you see what Y == "a" looks like inside your grouping.

If you combine these, to do: X[ Y == "a" ] inside your grouping, you effectively do:


X <- 1
X[ c(TRUE,TRUE,TRUE) ]

X having only one value, but are asked to return the first, second and third values, will give you its one value and 2 NA's, which is what you see.

like image 1
Sirius Avatar answered Oct 17 '22 06:10

Sirius