Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge in R results in more rows than one of the data frames

I have two data frames, the first contains 9994 rows and the second contains 60431 rows. I want to merge the two data frames such that the merged data frame contains combined columns of both data frames but only contains 9994 rows.

However, I get more than 9994 rows upon merge. How can I make sure this does not happen?

df1 = readRDS('data1.RDS')
nrow(df1)
# [1] 9994

df2 = readRDS('data2.RDS')
nrow(df2)
# [1] 60431

df = merge(df1,df2,by=c("col1","col2"))
nrow(df)
# [1] 10057

df = merge(df1,df2,by=c("col1","col2"),all.x=TRUE)
nrow(df)
# [1] 10057
nrow(na.omit(df))
# [1] 10057

EDIT : Following akrun's comment. Yes, there were duplicates in the second data frame

nrow(unique(df2[,c("col1","col2")]))
# [1] 60263
nrow(df2)
# [1] 60431

How can I take only one row from a data frame if there are multiple for the same {col1,col2} combination. When I merge, I would like to have only 9994 rows.

like image 410
tubby Avatar asked May 23 '15 08:05

tubby


People also ask

Why does merge Increase Number of rows?

The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each.

How do I merge two Dataframes with different number of rows in R?

Use the left_join Function to Merge Two R Data Frames With Different Number of Rows. left_join is another method from the dplyr package. It takes arguments similar to the full_join function, but left_join extracts all rows from the first data frame and all columns from both of them.

How do I merge data frames by rows in R?

To merge two data frames (datasets) horizontally, use the merge() function in the R language. To bind or combine rows in R, use the rbind() function. The rbind() stands for row binding.

What does the merge function do in R?

The R merge function allows merging two data frames by common columns or by row names. This function allows you to perform different database (SQL) joins, like left join, inner join, right join or full join, among others.


1 Answers

This should work, be sure to sort df2 first so you select the right rows.

df = merge(
  df1,
  df2[!duplicated(df2[, c("col1","col2")], ],
  by=c("col1","col2"),
  all.x=TRUE
)

What happens here: I merge the two data frames by the columns we want to merge by, but I first select only the first occurrence of any combination of col1 and col2 from the second data.frame df2.

duplicated checks if lines are duplicated if called with a data.frame. I select col1 and col2 from df2, so duplicated returns TRUE for rows with the same col1 and col2 but differences in other cols. Then I select only the rows which are not duplicated.

(Read the [-expressions carefully, and check the function calls from the inside out, to get the intermediate results)

edit: added explanation as suggested in comments

like image 85
snaut Avatar answered Sep 28 '22 11:09

snaut