Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference in NA/NULL treatment using dplyr::left_join (R lang) vs. SQL LEFT JOIN

I want to left join two dataframes, where there might be NAs in the join column on both side (i.e. both code columns)

a <- data.frame(code=c(1,2,NA))
b <- data.frame(code=c(1,2,NA, NA), name=LETTERS[1:4])

Using dplyr, we get:

left_join(a, b, by="code")

  code name
1    1    A
2    2    B
3   NA    C
4   NA    D

Using SQL, we get:

CREATE TABLE a (code INT);
INSERT INTO a VALUES (1),(2),(NULL);
CREATE TABLE b (code INT, name VARCHAR);
INSERT INTO b VALUES (1, 'A'),(2, 'B'),(NULL, 'C'), (NULL, 'D');
SELECT * FROM a LEFT JOIN b USING (code);

enter image description here

It seems that dplyr joins do not treat NAs like SQL NULL values.

  1. Is there a way to get dplyr to behave in the same way as SQL?
  2. What is rationale behind this type of NA treatment?

PS. Of course, I could remove NAs first to get there left_join(a, na.omit(b), by="code"), but that is not my question.

like image 643
Mark Heckmann Avatar asked Oct 18 '16 09:10

Mark Heckmann


2 Answers

In SQL, "null" matches nothing, because SQL has no information on what it should join to -- hence the resulting "null"s in your joined data set, just as it would appear if performing left outer joins without a match in the right data set.

In R however, the default behaviour for "NA" when it comes to joins is almost to treat it like a data point (e.g. a null operator), so "NA" would match "NA". For example,

> match(NA, NA)
[1] 1

One way you can circumvent this would be to use the base merge method,

> merge(a, b, by="code", all.x=TRUE, incomparables=NA)
  code name
1    1    A
2    2    B
3   NA <NA>

The "incomparables" parameter here allows you to define values that cannot be matched, and essentially forces R to treat "NA" the way SQL treats "null". It doesn't look like the incomparables feature is implemented in left_join, but it may simply be named differently.

like image 177
cissyc Avatar answered Oct 16 '22 22:10

cissyc


By default column code have primary key,therefore not accept NULL value

like image 1
Dayanand Jawalkar Avatar answered Oct 17 '22 00:10

Dayanand Jawalkar