Noticed some weird behavior of data.table, hopefully someone who understands data.table better than I can explain.
Say I have this data.table:
library(data.table)
DT <- data.table(
C1 = c(rep("A", 4), rep("B",4), rep("C", 4)),
C2 = c(rep("a", 3), rep("b",3), rep("c",3), rep("d",3)),
Val = c(1:5, NaN, NaN, 8,9,10,NaN,12))
DT
C1 C2 Val
1: A a 1
2: A a 2
3: A a 3
4: A b 4
5: B b 5
6: B b NaN
7: B c NaN
8: B c 8
9: C c 9
10: C d 10
11: C d NaN
12: C d 12
Now, in my mind, the following two methods should generate the same results, but they do not.
TEST1 <- DT[, agg := min(Val, na.rm = TRUE), by = c('C1', 'C2')]
TEST1 <- data.table(unique(TEST1[, c('C1','C2','agg'), with = FALSE]))
TEST2 <- DT[, list(agg = min(Val, na.rm = TRUE)), by = c('C1', 'C2')]
TEST1
C1 C2 agg
1: A a 1
2: A b 4
3: B b 5
4: B c 8
5: C c 9
6: C d 10
TEST2
C1 C2 agg
1: A a 1
2: A b 4
3: B b 5
4: B c NaN
5: C c 9
6: C d 10
As you can see, using " := " generates a minimum value for (C1 = B, C2 = c) of 8. Whereas the list command results in an NaN. Funnily enough, for (C1 = B,C2 = b) and (C1 = C, C2 = d), which also have NaNs, the list command does produce a value. I believe this to be because in the instance where the NaN is first before a value for a given C1 C2 combination, the NaN results. Whereas in the other two examples the NaN comes after a value.
Why does this occur?
I note that if the NaN are replaced with NA then values are generated with no problems.
Fixed this issue, #1461 just now in devel, v1.9.7 with commit 2080.
require(data.table) # v1.9.7, commit 2080+
DT <- data.table(
C1 = c(rep("A", 4), rep("B",4), rep("C", 4)),
C2 = c(rep("a", 3), rep("b",3), rep("c",3), rep("d",3)),
Val = c(1:5, NaN, NaN, 8,9,10,NaN,12))
DT[, list(agg = min(Val, na.rm = TRUE)), by = c('C1', 'C2')]
# C1 C2 agg
# 1: A a 1
# 2: A b 4
# 3: B b 5
# 4: B c 8
# 5: C c 9
# 6: C d 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With