Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select NA in a data.table in R

Tags:

How do I select all the rows that have a missing value in the primary key in a data table.

DT = data.table(x=rep(c("a","b",NA),each=3), y=c(1,3,6), v=1:9) setkey(DT,x)    

Selecting for a particular value is easy

DT["a",]   

Selecting for the missing values seems to require a vector search. One cannot use binary search. Am I correct?

DT[NA,]# does not work DT[is.na(x),] #does work 
like image 449
Farrel Avatar asked Sep 28 '12 19:09

Farrel


1 Answers

Fortunately, DT[is.na(x),] is nearly as fast as (e.g.) DT["a",], so in practice, this may not really matter much:

library(data.table) library(rbenchmark)  DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9) setkey(DT,x)    benchmark(DT["a",],           DT[is.na(x),],           replications=20) #             test replications elapsed relative user.self sys.self user.child # 1      DT["a", ]           20    9.18    1.000      7.31     1.83         NA # 2 DT[is.na(x), ]           20   10.55    1.149      8.69     1.85         NA 

===

Addition from Matthew (won't fit in comment) :

The data above has 3 very large groups, though. So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied).

benchmark(DT["a",],  # repeat select of large subset on my netbook     DT[is.na(x),],     replications=3)           test replications elapsed relative user.self sys.self      DT["a", ]            3   2.406    1.000     2.357    0.044 DT[is.na(x), ]            3   3.876    1.611     3.812    0.056  benchmark(DT["a",which=TRUE],   # isolate search time     DT[is.na(x),which=TRUE],     replications=3)                       test replications elapsed relative user.self sys.self      DT["a", which = TRUE]            3   0.492    1.000     0.492    0.000 DT[is.na(x), which = TRUE]            3   2.941    5.978     2.932    0.004 

As the size of the subset returned decreases (e.g. adding more groups), the difference becomes apparent. Vector scans on a single column aren't too bad, but on 2 or more columns it quickly degrades.

Maybe NAs should be joinable to. I seem to remember a gotcha with that, though. Here's some history linked from FR#1043 Allow or disallow NA in keys?. It mentions there that NA_integer_ is internally a negative integer. That trips up radix/counting sort (iirc) resulting in setkey going slower. But it's on the list to revisit.

like image 142
Josh O'Brien Avatar answered Oct 27 '22 20:10

Josh O'Brien