I'd like to merge two data frames by id
, but they both have 2 of the same columns; therefore, when I merge i get new .x
and .y
columns. How can I merge these two data frames with left_join()
and remove the extra columns currently in my code that are the same (`element.x, day.x, element.y, and day.y) and keep a single column.
Code:
# Sample data
df1 <- data.frame(id = seq(1,5), value1 = rnorm(5), element = "TEST1", day = 15)
df2 <- data.frame(id = seq(1,5), value2 = rnorm(5), element = "TEST1", day = 15)
# Merge
df <- left_join(df1, df2, by = "id")
# Output
id value1 element.x day.x value2 element.y day.y
1 1 -0.69700149 TEST1 15 1.4324220 TEST1 15
2 2 -0.25514949 TEST1 15 0.7281354 TEST1 15
3 3 0.09206902 TEST1 15 0.8148839 TEST1 15
4 4 2.51799237 TEST1 15 1.3919671 TEST1 15
5 5 -0.77049050 TEST1 15 -0.2707201 TEST1 15
The easiest way to remove repeated column names from a data frame is by using the duplicated() function. This function (together with the colnames() function) indicates for each column name if it appears more than once. Using this information and square brackets one can easily remove the duplicate column names.
Method 1: Using drop() function We can join the dataframes using joins like inner join and after this join, we can use the drop method to remove one duplicate column.
Removing duplicate columns after join in PySpark If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.
To remove a single column, select the column you want to remove, and then select Home > Remove Columns > Remove Columns.
Just drop everything you don't want from df2
- in this case the id
and value2
columns:
left_join(df1, select(df2, c(id,value2)), by = "id")
# id value1 element day value2
#1 1 1.2276303 TEST1 15 -0.1389861
#2 2 -0.8017795 TEST1 15 -0.5973131
#3 3 -1.0803926 TEST1 15 -2.1839668
#4 4 -0.1575344 TEST1 15 0.2408173
#5 5 -1.0717600 TEST1 15 -0.2593554
Beware that not all these answers are equivalent, and ask what it is you need as a result. E.g.:
df1 <- data.frame(id=1:3,day=2:4,element=3:5,value1=100:102)
df2 <- data.frame(id=1:3,day=3:5,element=4:6,value2=200:202)
df1
# id day element value1
#1 1 2 3 100
#2 2 3 4 101
#3 3 4 5 102
df2
# id day element value2
#1 1 3 4 200
#2 2 4 5 201
#3 3 5 6 202
left_join(df1, df2)
#Joining by: c("id", "day", "element")
# id day element value1 value2
#1 1 2 3 100 NA
#2 2 3 4 101 NA
#3 3 4 5 102 NA
left_join(df1, select(df2, c(id,value2)), by = "id")
# id day element value1 value2
#1 1 2 3 100 200
#2 2 3 4 101 201
#3 3 4 5 102 202
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With