Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table inner/outer join with NA in join column of type double bug?

Following this wikipedia article SQL join I wanted to have a clear view on how we could have joins with data.table. In the process we might have uncovered a bug when joining with NAs. Taking the wiki example:

R) X = data.table(name=c("Raf","Jon","Ste","Rob","Smi","Joh"),depID=c(31,33,33,34,34,NA),key="depID")
R) Y = data.table(depID=c(31,33,34,35),depName=c("Sal","Eng","Cle","Mar"),key="depID")
R) X
   name depID
1:  Joh    NA
2:  Raf    31
3:  Jon    33
4:  Ste    33
5:  Rob    34
6:  Smi    34
R) Y
   depID depName
1:    31     Sal
2:    33     Eng
3:    34     Cle
4:    35     Mar

LEFT OUTER JOIN

R) merge.data.frame(X,Y,all.x=TRUE)
  depID name depName
1    31  Raf     Sal
2    33  Jon     Eng
3    33  Ste     Eng
4    34  Rob     Cle
5    34  Smi     Cle
6    NA  Joh    <NA>

merge.data.table do not output the same result and show what I think is a bug on lign 2.

R) merge(X,Y,all.x=TRUE)
   depID name depName
1:    NA  Joh     Eng
2:    31  Raf      NA
3:    33  Jon     Eng
4:    33  Ste     Eng
5:    34  Rob     Cle
6:    34  Smi     Cle
R) Y[X] #same -> :(
   depID depName name
1:    NA     Eng  Joh
2:    31      NA  Raf
3:    33     Eng  Jon
4:    33     Eng  Ste
5:    34     Cle  Rob
6:    34     Cle  Smi

RIGHT OUTER JOIN Looks like the same

R) merge.data.frame(X,Y,all.y=TRUE)
  depID name depName
1    31  Raf     Sal
2    33  Jon     Eng
3    33  Ste     Eng
4    34  Rob     Cle
5    34  Smi     Cle
6    35 <NA>     Mar

R) merge(X,Y,all.y=TRUE)
   depID name depName
1:    NA  Joh     Eng
2:    31   NA     Sal
3:    33  Jon     Eng
4:    33  Ste     Eng
5:    34  Rob     Cle 
6:    34  Smi     Cle
7:    35   NA     Mar

INNER (NATURAL) JOIN

R) merge.data.frame(X,Y)
  depID name depName
1    31  Raf     Sal
2    33  Jon     Eng
3    33  Ste     Eng
4    34  Rob     Cle
5    34  Smi     Cle
R) merge(X,Y)
   depID name depName
1:    NA  Joh     Eng
2:    33  Jon     Eng
3:    33  Ste     Eng
4:    34  Rob     Cle
5:    34  Smi     Cle
like image 423
statquant Avatar asked Dec 28 '12 21:12

statquant


People also ask

What are the 4 join types?

Four types of joins: left, right, inner, and outer.

What does (+) mean in SQL joins?

The plus sign is Oracle syntax for an outer join. There isn't a minus operator for joins. An outer join means return all rows from one table. Also return the rows from the outer joined where there's a match on the join key. If there's no matching row, return null.

What is join discuss inner and outer join with example?

The biggest difference between an INNER JOIN and an OUTER JOIN is that the inner join will keep only the information from both tables that's related to each other (in the resulting table). An Outer Join, on the other hand, will also keep information that is not related to the other table in the resulting table.

What are the three types of outer join?

There are three types of outer joins: left outer join, right outer join, and full outer join.


2 Answers

Following up on comments in other answer, yes, here is the proof that it only affects type double columns (NA in integer and character columns are ok).

X = data.table(name=c("Raf","Jon","Ste","Rob","Smi","Joh"),
               depID=as.integer(c(31,33,33,34,34,NA)),key="depID")
Y = data.table(depID=as.integer(c(31,33,34,35)),
               depName=c("Sal","Eng","Cle","Mar"),key="depID")
Y[X]
   depID depName name
1:    NA      NA  Joh
2:    31     Sal  Raf
3:    33     Eng  Jon
4:    33     Eng  Ste
5:    34     Cle  Rob
6:    34     Cle  Smi

merge.data.frame(X,Y,all.x=T)
  depID name depName
1    31  Raf     Sal
2    33  Jon     Eng
3    33  Ste     Eng
4    34  Rob     Cle
5    34  Smi     Cle
6    NA  Joh    <NA>

Y = data.table(depID=as.character(c(31,33,34,35)),
               depName=c("Sal","Eng","Cle","Mar"),key="depID")
X = data.table(name=c("Raf","Jon","Ste","Rob","Smi","Joh"),
               depID=as.character(c(31,33,33,34,34,NA)),key="depID")
X
   name depID
1:  Raf    31
2:  Jon    33
3:  Ste    33
4:  Rob    34
5:  Smi    34
6:  Joh    NA
Y
   depID depName
1:    31     Sal
2:    33     Eng
3:    34     Cle
4:    35     Mar
str(X)
Classes ‘data.table’ and 'data.frame':  6 obs. of  2 variables:
 $ name : chr  "Raf" "Jon" "Ste" "Rob" ...
 $ depID: chr  "31" "33" "33" "34" ...
 - attr(*, "sorted")= chr "depID"
 - attr(*, ".internal.selfref")=<externalptr> 

merge.data.frame(X,Y,all.x=T)
  depID name depName
1    31  Raf     Sal
2    33  Jon     Eng
3    33  Ste     Eng
4    34  Rob     Cle
5    34  Smi     Cle
6  <NA>  Joh    <NA>

Y[X]
   depID depName name
1:    31     Sal  Raf
2:    33     Eng  Jon
3:    33     Eng  Ste
4:    34     Cle  Rob
5:    34     Cle  Smi
6:    NA      NA  Joh

THE PROBLEM HAS BEEN FIXED BY MATTHEW DOWLE IN V.1.8.7

like image 132
statquant Avatar answered Oct 13 '22 02:10

statquant


Yes it looks like an (embarassing) new bug related to the NA in key. There have been other discussions about NA in key not being possible but I didn't realise it could mess up in that way. Will investigate. Thanks ...

#2453 NA in double key column messes up joins (NA in integer and character ok)

Now fixed in 1.8.7 (commit 780), from NEWS :

NA in a join column of type double could cause both X[Y] and merge(X,Y) to return incorrect results, #2453. Due to an errant x==NA_REAL in the C source which should have been ISNA(x). Support for double in keyed joins is a relatively recent addition to data.table, but embarassing all the same. Fixed and tests added. Many thanks to statquant for the thorough and reproducible report.

like image 28
Matt Dowle Avatar answered Oct 13 '22 00:10

Matt Dowle