Surely this is not intended? Is this something that happens in other parts of dplyr
's functionality and should I be concerned? I love the performance and hate data.table
syntax. Is there an alternative to dplyr
and data.table
that is currently safe to use and still high performance?
A <- structure(list(ORDER = c(30305720L, 30334659L, 30379936L,
30406397L, 30407697L, 30431950L),
COST = c("0", "", "11430.52", "20196.279999999999", "0", "10445.99")),
.Names = c("ORDER", "COST"),
row.names = c(NA, 6L),
class = "data.frame")
B <- structure(list(ORDER = c(30334659, 30379936, 30406397, 30407697, 30431950),
AREA = c(0, 2339, 2162, 23040, 475466)),
.Names = c("ORDER", "AREA"),
row.names = c(4L, 8L, 11L, 12L, 15L),
class = c("tbl_df", "tbl", "data.frame"))
Garbage results:
left_join(A, B)
ORDER COST AREA
1 30305720 0 NA
2 30334659 NA
3 30379936 11430.52 NA
4 30406397 20196.279999999999 NA
5 30407697 0 NA
6 30431950 10445.99 NA
Effective results:
merge(A, B, all.x=T, all.y=F)
ORDER COST AREA
1 30305720 0 NA
2 30334659 0
3 30379936 11430.52 2339
4 30406397 20196.279999999999 2162
5 30407697 0 23040
6 30431950 10445.99 475466
I posted something similar the other day. I think what you need to do is to have ORDER as numeric (or possibly the other way around). A has ORDER has integer. But B has ORDER as numeric. At the moment, dplyr
asks you to have group-by variables in the same class. I received a comment from an SO user saying that this is something Hadley and his team has been working on now. This issue will be fixed in the future.
A$ORDER <- as.numeric(A$ORDER)
left_join(A,B, by = "ORDER")
ORDER COST AREA
1 30305720 0 NA
2 30334659 0
3 30379936 11430.52 2339
4 30406397 20196.279999999999 2162
5 30407697 0 23040
6 30431950 10445.99 475466
UPDATE After exchanging comments with thelatemail, I decided to add more observations here.
CASE 1: Treat ORDER as numeric
A$ORDER <- as.numeric(A$ORDER)
> left_join(A,B, by = "ORDER")
ORDER COST AREA
1 30305720 0 NA
2 30334659 0
3 30379936 11430.52 2339
4 30406397 20196.279999999999 2162
5 30407697 0 23040
6 30431950 10445.99 475466
> left_join(B,A, by = "ORDER")
Source: local data frame [5 x 3]
ORDER AREA COST
1 30334659 0
2 30379936 2339 11430.52
3 30406397 2162 20196.279999999999
4 30407697 23040 0
5 30431950 475466 10445.99
If you have ORDER as integer in both A and B, that works too.
CASE 2: Treat ORDER as integer and numeric
> left_join(A,B, by = "ORDER")
ORDER COST AREA
1 30305720 0 NA
2 30334659 NA
3 30379936 11430.52 NA
4 30406397 20196.279999999999 NA
5 30407697 0 NA
6 30431950 10445.99 NA
> left_join(B,A, by = "ORDER")
Source: local data frame [5 x 3]
ORDER AREA COST
1 30334659 0
2 30379936 2339 11430.52
3 30406397 2162 20196.279999999999
4 30407697 23040 0
5 30431950 475466 10445.99
As suggested by thelatemail, integer/numeric combination does not work. But numeric/integer combination works.
Given these observations, it is safe to be consistent in group-by variable at the moment. Alternatively, merge()
is the way to go. It can handle integer and numeric.
> merge(A,B, by = "ORDER", all = TRUE)
ORDER COST AREA
1 30305720 0 NA
2 30334659 0
3 30379936 11430.52 2339
4 30406397 20196.279999999999 2162
5 30407697 0 23040
6 30431950 10445.99 475466
> merge(B,A, by = "ORDER", all = TRUE)
ORDER AREA COST
1 30305720 NA 0
2 30334659 0
3 30379936 2339 11430.52
4 30406397 2162 20196.279999999999
5 30407697 23040 0
6 30431950 475466 10445.99
UPDATE2 (as of the 8th of November, 2014)
I am using a dev version of dplyr
(dplyr_0.3.0.9000), which you can download from Github.
The issue above is now solved.
left_join(A,B, by = "ORDER")
# ORDER COST AREA
#1 30305720 0 NA
#2 30334659 0
#3 30379936 11430.52 2339
#4 30406397 20196.279999999999 2162
#5 30407697 0 23040
#6 30431950 10445.99 475466
From the dplyr documentation:
left_join()
returns all rows from
x
, and all columns fromx
andy
. Rows inx
with no match iny
will haveNA
values in the new columns. If there are multiple matches betweenx
andy
, all combinations of the matches are returned.
semi_join()
returns all rows from
x
where there are matching values iny
, keeping just columns fromx
.A semi join differs from an inner join because an inner join will return one row of
x
for each matching row ofy
, where a semi join will never duplicate rows ofx
.
Is semi_join()
a valuable option for you?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With