Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join creates .x and .y column but they have identical content - why?

Tags:

r

dplyr

I am confused about the output I am getting from a join command in dplyr like this:

d <- d1 %>% left_join(d2, by="someColumn")

The resulting df is what I expected except a column "someOtherColumn" is present as "someOtherColumn.x" and "someOtherColumn.y". I know that this is expected if the value in someColumn (on which I join) has diferent values in d1 and d2 but when I try to look at the rows affected with this:

d %>% filter( someOtherColumn.x != someOtherColumn.y)

I get no rows. How is that possible? Why might dplyr create the .x/.y columns for a column that has identical values in the two dataframes in the given join? It hasn't created .x/.y for any of the other columns.

Apologies for not showing any data. I can't share the actual data and I can't mock up a dataset that reproduces the error because I don't know what causes it.

like image 942
tospo Avatar asked Sep 15 '25 14:09

tospo


1 Answers

Let's take this simple example.

library(dplyr)
set.seed(123)
df1 <- data.frame(a = 1:4, b = 1:4, c = rnorm(4))
df2 <- data.frame(a = 4:1, b = 4:1, c = rnorm(4))
df1

#  a b           c
#1 1 1 -0.56047565
#2 2 2 -0.23017749
#3 3 3  1.55870831
#4 4 4  0.07050839

df2
#  a b          c
#1 4 4  0.1292877
#2 3 3  1.7150650
#3 2 2  0.4609162
#4 1 1 -1.2650612

Notice the values in column a and b are the same in both the dataframes (although the order is different).

When you join only by a you get

df1 %>% left_join(df2, by = 'a')
#  a b.x         c.x b.y        c.y
#1 1   1 -0.56047565   1 -1.2650612
#2 2   2 -0.23017749   2  0.4609162
#3 3   3  1.55870831   3  1.7150650
#4 4   4  0.07050839   4  0.1292877

You have told to join only by a so it will match only a column, rest of the columns are treated differently even if their values are the same. Hence you get b.x, b.y as well c.x and c.y.

If you want that b.x and b.y should not be generated as they are the same specify it in by.

df1 %>% left_join(df2, by = c('a', 'b'))

#  a b         c.x        c.y
#1 1 1 -0.56047565 -1.2650612
#2 2 2 -0.23017749  0.4609162
#3 3 3  1.55870831  1.7150650
#4 4 4  0.07050839  0.1292877

Now you get only c.x and c.y additional columns.

like image 73
Ronak Shah Avatar answered Sep 18 '25 05:09

Ronak Shah