I want to left join two dataframes, where there might be NA
s in the join column on both side (i.e. both code
columns)
a <- data.frame(code=c(1,2,NA))
b <- data.frame(code=c(1,2,NA, NA), name=LETTERS[1:4])
Using dplyr
, we get:
left_join(a, b, by="code")
code name
1 1 A
2 2 B
3 NA C
4 NA D
Using SQL, we get:
CREATE TABLE a (code INT);
INSERT INTO a VALUES (1),(2),(NULL);
CREATE TABLE b (code INT, name VARCHAR);
INSERT INTO b VALUES (1, 'A'),(2, 'B'),(NULL, 'C'), (NULL, 'D');
SELECT * FROM a LEFT JOIN b USING (code);
It seems that dplyr
joins do not treat NA
s like SQL NULL
values.
dplyr
to behave in the same way as SQL?PS. Of course, I could remove NAs first to get there left_join(a, na.omit(b), by="code")
, but that is not my question.
In SQL, "null" matches nothing, because SQL has no information on what it should join to -- hence the resulting "null"s in your joined data set, just as it would appear if performing left outer joins without a match in the right data set.
In R however, the default behaviour for "NA" when it comes to joins is almost to treat it like a data point (e.g. a null operator), so "NA" would match "NA". For example,
> match(NA, NA)
[1] 1
One way you can circumvent this would be to use the base merge method,
> merge(a, b, by="code", all.x=TRUE, incomparables=NA)
code name
1 1 A
2 2 B
3 NA <NA>
The "incomparables" parameter here allows you to define values that cannot be matched, and essentially forces R to treat "NA" the way SQL treats "null". It doesn't look like the incomparables feature is implemented in left_join, but it may simply be named differently.
By default column code have primary key,therefore not accept NULL value
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With