I have a column in a data table with NaN
values. Something like:
my.dt <- data.table(x = c(NaN, NaN, NaN, .1, .2, .2, .3), y = c(2, 4, 6, 8, 10, 12, 14))
setkey(my.dt, x)
I can use the J()
function to find all instances where the x
column is equal to .2
> my.dt[J(.2)]
x y
1: 0.2 10
2: 0.2 12
But if I try to do the same thing with NaN
it doesn't work.
> my.dt[J(NaN)]
x y
1: NaN NA
I would expect:
x y
1: NaN 2
2: NaN 4
3: NaN 6
What gives? I can't find anything in the data.table documentation to explain why this is happening (although it may just be that I don't know what to look for). Is there any way to get what I want? Ultimately, I'd like to replace all of the NaN
values with zero, using something like my.dt[J(NaN), x := 0]
Update: This has been fixed a while back, in v1.9.2. From NEWS:
NA
,NaN
,+Inf
and-Inf
are now considered distinct values, may be in keys, can be joined to and can be grouped. data.table defines: NA < NaN < -Inf. Thanks to Martin Liberts for the suggestions, #4684, #4815 and #4883.
require(data.table) ## 1.9.2+
my.dt[J(NaN)]
# x y
# 1: NaN 2
# 2: NaN 4
# 3: NaN 6
This issue is part design choice, part bug. There are several questions on SO and a few emails on the listserv exploring NA's in data.table
key.
The main idea is outlined in the FAQ in that NA
's are treated as FALSE
http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097.html
subsetting a data.table using !=<some non-NA> excludes NA too
NA in `i` expression of data.table (possible bug)
DT[!(x == .)] and DT[x != .] treat NA in x inconsistently
In the meantime, your best bet is to use is.na
.
While it is slower than a radix search, it is still faster than most vector searches in R
, and certainly much, much faster than any fancy workarounds
library(microbenchmark)
microbenchmark(my.dt[.(1)], my.dt[is.na(ID)], my.dt[ID==1], my.dt[!!!(ID)])
# Unit: milliseconds
expr median
my.dt[.(1)] 1.309948
my.dt[is.na(ID)] 3.444689 <~~ Not bad
my.dt[ID == 1] 4.005093
my.dt[!(!(!(ID)))] 10.038134
### using the following for my.dt
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)
Here's a fast workaround that relies a lot on what's actually happening internally (making the code a bit fragile imo). Because internally NaN
is just a very very negative number, it will always be at the front of your data.table
when you setkey
. We can use that property to isolate those entries like so:
# this will give the index of the first element that is *not* NaN
my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T]
# this is equivalent to my.dt[!is.nan(x)], but much faster
my.dt[seq_len(my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T] - 1)]
Here's a benchmark for Ricardo's sample data:
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)
# NOTE: I have to use integer max here - because this example has integers
# instead of doubles, so I'll just add simple helper function (that would
# likely need to be extended for other cases, but I'm just dealing with the ones here)
minN = function(x) if (is.integer(x)) -.Machine$integer.max else -.Machine$double.xmax
library(microbenchmark)
microbenchmark(normalJ = my.dt[J(1)],
naJ = my.dt[seq_len(my.dt[J(minN(ID)), roll = -Inf, which = T] - 1)])
#Unit: milliseconds
# expr min lq median uq max neval
# normalJ 1.645442 1.864812 2.120577 2.863497 5.431828 100
# naJ 1.465806 1.689350 2.030425 2.600720 10.436934 100
In my tests the following minN
function also covers character and logical vectors:
minN = function(x) {
if (is.integer(x)) {
-.Machine$integer.max
} else if (is.numeric(x)) {
-.Machine$double.xmax
} else if (is.character(x)) {
""
} else if (is.logical(x)) {
FALSE
} else {
NA
}
}
And you will want to add mult = 'first'
, e.g.:
my.dt[seq_len(my.dt[J(minN(colname)), roll = -Inf, which = T, mult = 'first'] - 1)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With