Following this wikipedia article SQL join I wanted to have a clear view on how we could have joins with data.table. In the process we might have uncovered a bug when joining with NAs. Taking the wiki example:
R) X = data.table(name=c("Raf","Jon","Ste","Rob","Smi","Joh"),depID=c(31,33,33,34,34,NA),key="depID")
R) Y = data.table(depID=c(31,33,34,35),depName=c("Sal","Eng","Cle","Mar"),key="depID")
R) X
name depID
1: Joh NA
2: Raf 31
3: Jon 33
4: Ste 33
5: Rob 34
6: Smi 34
R) Y
depID depName
1: 31 Sal
2: 33 Eng
3: 34 Cle
4: 35 Mar
LEFT OUTER JOIN
R) merge.data.frame(X,Y,all.x=TRUE)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
6 NA Joh <NA>
merge.data.table
do not output the same result and show what I think is a bug on lign 2.
R) merge(X,Y,all.x=TRUE)
depID name depName
1: NA Joh Eng
2: 31 Raf NA
3: 33 Jon Eng
4: 33 Ste Eng
5: 34 Rob Cle
6: 34 Smi Cle
R) Y[X] #same -> :(
depID depName name
1: NA Eng Joh
2: 31 NA Raf
3: 33 Eng Jon
4: 33 Eng Ste
5: 34 Cle Rob
6: 34 Cle Smi
RIGHT OUTER JOIN Looks like the same
R) merge.data.frame(X,Y,all.y=TRUE)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
6 35 <NA> Mar
R) merge(X,Y,all.y=TRUE)
depID name depName
1: NA Joh Eng
2: 31 NA Sal
3: 33 Jon Eng
4: 33 Ste Eng
5: 34 Rob Cle
6: 34 Smi Cle
7: 35 NA Mar
INNER (NATURAL) JOIN
R) merge.data.frame(X,Y)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
R) merge(X,Y)
depID name depName
1: NA Joh Eng
2: 33 Jon Eng
3: 33 Ste Eng
4: 34 Rob Cle
5: 34 Smi Cle
Four types of joins: left, right, inner, and outer.
The plus sign is Oracle syntax for an outer join. There isn't a minus operator for joins. An outer join means return all rows from one table. Also return the rows from the outer joined where there's a match on the join key. If there's no matching row, return null.
The biggest difference between an INNER JOIN and an OUTER JOIN is that the inner join will keep only the information from both tables that's related to each other (in the resulting table). An Outer Join, on the other hand, will also keep information that is not related to the other table in the resulting table.
There are three types of outer joins: left outer join, right outer join, and full outer join.
Following up on comments in other answer, yes, here is the proof that it only affects type double
columns (NA in integer
and character
columns are ok).
X = data.table(name=c("Raf","Jon","Ste","Rob","Smi","Joh"),
depID=as.integer(c(31,33,33,34,34,NA)),key="depID")
Y = data.table(depID=as.integer(c(31,33,34,35)),
depName=c("Sal","Eng","Cle","Mar"),key="depID")
Y[X]
depID depName name
1: NA NA Joh
2: 31 Sal Raf
3: 33 Eng Jon
4: 33 Eng Ste
5: 34 Cle Rob
6: 34 Cle Smi
merge.data.frame(X,Y,all.x=T)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
6 NA Joh <NA>
Y = data.table(depID=as.character(c(31,33,34,35)),
depName=c("Sal","Eng","Cle","Mar"),key="depID")
X = data.table(name=c("Raf","Jon","Ste","Rob","Smi","Joh"),
depID=as.character(c(31,33,33,34,34,NA)),key="depID")
X
name depID
1: Raf 31
2: Jon 33
3: Ste 33
4: Rob 34
5: Smi 34
6: Joh NA
Y
depID depName
1: 31 Sal
2: 33 Eng
3: 34 Cle
4: 35 Mar
str(X)
Classes ‘data.table’ and 'data.frame': 6 obs. of 2 variables:
$ name : chr "Raf" "Jon" "Ste" "Rob" ...
$ depID: chr "31" "33" "33" "34" ...
- attr(*, "sorted")= chr "depID"
- attr(*, ".internal.selfref")=<externalptr>
merge.data.frame(X,Y,all.x=T)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
6 <NA> Joh <NA>
Y[X]
depID depName name
1: 31 Sal Raf
2: 33 Eng Jon
3: 33 Eng Ste
4: 34 Cle Rob
5: 34 Cle Smi
6: NA NA Joh
Yes it looks like an (embarassing) new bug related to the NA in key. There have been other discussions about NA in key not being possible but I didn't realise it could mess up in that way. Will investigate. Thanks ...
#2453 NA in double key column messes up joins (NA in integer and character ok)
Now fixed in 1.8.7 (commit 780), from NEWS :
NA in a join column of type double could cause both X[Y] and merge(X,Y) to return incorrect results, #2453. Due to an errant x==NA_REAL in the C source which should have been ISNA(x). Support for double in keyed joins is a relatively recent addition to data.table, but embarassing all the same. Fixed and tests added. Many thanks to statquant for the thorough and reproducible report.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With