I have a data.table
that contains some groups. I operate on each group and some groups return numbers, others return NA
. For some reason data.table
has trouble putting everything back together. Is this a bug or am I misunderstanding? Here is an example:
dtb <- data.table(a=1:10)
f <- function(x) {if (x==9) {return(NA)} else { return(x)}}
dtb[,f(a),by=a]
Error in `[.data.table`(dtb, , f(a), by = a) :
columns of j don't evaluate to consistent types for each group: result for group 9 has column 1 type 'logical' but expecting type 'integer'
My understanding was that NA
is compatible with numbers in R since clearly we can have a data.table
that has NA
values. I realize I can return NULL
and that will work fine but the issue is with NA
.
From ?NA
NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved words in the R language.
You will have to specify the correct type for your function to work -
You can coerce within the function to match the type of x
(note we need any
for this to work for situations with more than 1 row in a subset!
f <- function(x) {if any((x==9)) {return(as(NA, class(x)))} else { return(x)}}
It might make more data.table sense to use set
(or :=
) to set / replace by reference.
set(dtb, i = which(dtb[,a]==9), j = 'a', value=NA_integer_)
Or :=
within [
using a vector scan for a==9
dtb[a == 9, a := NA_integer_]
Or :=
along with a binary search
setkeyv(dtb, 'a')
dtb[J(9), a := NA_integer_]
If you use the :=
or set
approaches, you don't appear to need to specify the NA
type
Both the following will work
dtb <- data.table(a=1:10)
setkeyv(dtb,'a')
dtb[a==9,a := NA]
dtb <- data.table(a=1:10)
setkeyv(dtb,'a')
set(dtb, which(dtb[,a] == 9), 'a', NA)
Error in
[.data.table
(DTc, J(9),:=
(a, NA)) : Type of RHS ('logical') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)
with a reasonable large data.set where a
is replaced in situ
library(data.table)
set.seed(1)
n <- 1e+07
DT <- data.table(a = sample(15, n, T))
setkeyv(DT, "a")
DTa <- copy(DT)
DTb <- copy(DT)
DTc <- copy(DT)
DTd <- copy(DT)
DTe <- copy(DT)
f <- function(x) {
if (any(x == 9)) {
return(as(NA, class(x)))
} else {
return(x)
}
}
system.time({DT[a == 9, `:=`(a, NA_integer_)]})
## user system elapsed
## 0.95 0.24 1.20
system.time({DTa[a == 9, `:=`(a, NA)]})
## user system elapsed
## 0.74 0.17 1.00
system.time({DTb[J(9), `:=`(a, NA_integer_)]})
## user system elapsed
## 0.02 0.00 0.02
system.time({set(DTc, which(DTc[, a] == 9), j = "a", value = NA)})
## user system elapsed
## 0.49 0.22 0.67
system.time({set(DTc, which(DTd[, a] == 9), j = "a", value = NA_integer_)})
## user system elapsed
## 0.54 0.06 0.58
system.time({DTe[, `:=`(a, f(a)), by = a]})
## user system elapsed
## 0.53 0.12 0.66
# The are all the same!
all(identical(DT, DTa), identical(DT, DTb), identical(DT, DTc), identical(DT,
DTd), identical(DT, DTe))
## [1] TRUE
Unsurprisingly the binary search approach is the fastest
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With