Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incorrect behavior with dplyr's left_join?

Surely this is not intended? Is this something that happens in other parts of dplyr's functionality and should I be concerned? I love the performance and hate data.table syntax. Is there an alternative to dplyr and data.table that is currently safe to use and still high performance?

A <- structure(list(ORDER = c(30305720L, 30334659L, 30379936L, 
                     30406397L, 30407697L, 30431950L), 
                    COST = c("0", "", "11430.52", "20196.279999999999", "0", "10445.99")), 
               .Names = c("ORDER", "COST"), 
               row.names = c(NA, 6L), 
               class = "data.frame")

B <- structure(list(ORDER = c(30334659, 30379936, 30406397, 30407697, 30431950), 
                    AREA = c(0, 2339, 2162, 23040, 475466)), 
               .Names = c("ORDER", "AREA"), 
               row.names = c(4L, 8L, 11L, 12L, 15L), 
               class = c("tbl_df", "tbl", "data.frame"))

Garbage results:

left_join(A, B)
  ORDER    COST                 AREA
1 30305720                  0   NA
2 30334659                      NA
3 30379936           11430.52   NA
4 30406397 20196.279999999999   NA
5 30407697                  0   NA
6 30431950           10445.99   NA

Effective results:

merge(A, B, all.x=T, all.y=F)
     ORDER               COST   AREA
1 30305720                  0     NA
2 30334659                         0
3 30379936           11430.52   2339
4 30406397 20196.279999999999   2162
5 30407697                  0  23040
6 30431950           10445.99 475466
like image 923
stanekam Avatar asked Sep 16 '14 01:09

stanekam


2 Answers

I posted something similar the other day. I think what you need to do is to have ORDER as numeric (or possibly the other way around). A has ORDER has integer. But B has ORDER as numeric. At the moment, dplyr asks you to have group-by variables in the same class. I received a comment from an SO user saying that this is something Hadley and his team has been working on now. This issue will be fixed in the future.

A$ORDER <- as.numeric(A$ORDER)
left_join(A,B, by = "ORDER")

     ORDER               COST   AREA
1 30305720                  0     NA
2 30334659                         0
3 30379936           11430.52   2339
4 30406397 20196.279999999999   2162
5 30407697                  0  23040
6 30431950           10445.99 475466

UPDATE After exchanging comments with thelatemail, I decided to add more observations here.

CASE 1: Treat ORDER as numeric

A$ORDER <- as.numeric(A$ORDER)

> left_join(A,B, by = "ORDER")
     ORDER               COST   AREA
1 30305720                  0     NA
2 30334659                         0
3 30379936           11430.52   2339
4 30406397 20196.279999999999   2162
5 30407697                  0  23040
6 30431950           10445.99 475466

> left_join(B,A, by = "ORDER")
Source: local data frame [5 x 3]

     ORDER   AREA               COST
1 30334659      0                   
2 30379936   2339           11430.52
3 30406397   2162 20196.279999999999
4 30407697  23040                  0
5 30431950 475466           10445.99

If you have ORDER as integer in both A and B, that works too.

CASE 2: Treat ORDER as integer and numeric

> left_join(A,B, by = "ORDER")
     ORDER               COST AREA
1 30305720                  0   NA
2 30334659                      NA
3 30379936           11430.52   NA
4 30406397 20196.279999999999   NA
5 30407697                  0   NA
6 30431950           10445.99   NA

> left_join(B,A, by = "ORDER")
Source: local data frame [5 x 3]

     ORDER   AREA               COST
1 30334659      0                   
2 30379936   2339           11430.52
3 30406397   2162 20196.279999999999
4 30407697  23040                  0
5 30431950 475466           10445.99

As suggested by thelatemail, integer/numeric combination does not work. But numeric/integer combination works.

Given these observations, it is safe to be consistent in group-by variable at the moment. Alternatively, merge() is the way to go. It can handle integer and numeric.

> merge(A,B, by = "ORDER", all = TRUE)
     ORDER               COST   AREA
1 30305720                  0     NA
2 30334659                         0
3 30379936           11430.52   2339
4 30406397 20196.279999999999   2162
5 30407697                  0  23040
6 30431950           10445.99 475466 

> merge(B,A, by = "ORDER", all = TRUE)
     ORDER   AREA               COST
1 30305720     NA                  0
2 30334659      0                   
3 30379936   2339           11430.52
4 30406397   2162 20196.279999999999
5 30407697  23040                  0
6 30431950 475466           10445.99

UPDATE2 (as of the 8th of November, 2014)

I am using a dev version of dplyr(dplyr_0.3.0.9000), which you can download from Github. The issue above is now solved.

left_join(A,B, by = "ORDER")
#     ORDER               COST   AREA
#1 30305720                  0     NA
#2 30334659                         0
#3 30379936           11430.52   2339
#4 30406397 20196.279999999999   2162
#5 30407697                  0  23040
#6 30431950           10445.99 475466
like image 158
jazzurro Avatar answered Sep 27 '22 22:09

jazzurro


From the dplyr documentation:

left_join()

returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

semi_join()

returns all rows from x where there are matching values in y, keeping just columns from x.

A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

Is semi_join() a valuable option for you?

like image 43
paoloeusebi Avatar answered Sep 27 '22 23:09

paoloeusebi