Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table results differ between vector scan and binary search for missing data

Tags:

r

data.table

This is from the examples in the data.table introduction. See http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf

The examples go on that a binary search is faster than a vector scan and produces exactly the same result (see page 5). So here is my example:

library(data.table)
grpsize = ceiling(10000/26^2) 
DF <- data.frame(x=rep(LETTERS,each=26*grpsize), y=rep(letters,each=grpsize),v=runif(grpsize*26^2), stringsAsFactors=FALSE)
DT = data.table(DF)
setkey(DT,x,y)

DT[x=='R' & y=='h']
DT[J("R","h")]

As expected this returns exactly the same result. One scans every row, the other is a binary search. However, when there are rows that are not existent the results differ. See the following code:

DT[x=='R' & y=='H']
DT[J("R","H")]

I get the following results

# > DT[x=='R' & y=='H', ]
# Empty data.table (0 rows) of 3 cols: x,y,v

# > DT[J("R","H")]
#    x  y  v
# 1: R H NA

a.) Why is this the case?

b.) Is there a way to change the behaviour of the binary search to not return results of non existing rows?

like image 557
Wolfgang Wu Avatar asked Aug 20 '13 15:08

Wolfgang Wu


1 Answers

I guess J is more than just a binary search; it's a "join." For each key combination it is given, it has to return something. To turn it off:

DT[J('R','H'),nomatch=0]
like image 54
Frank Avatar answered Nov 06 '22 03:11

Frank