merge in R results in more rows than one of the data frames

Tags:

I have two data frames, the first contains 9994 rows and the second contains 60431 rows. I want to merge the two data frames such that the merged data frame contains combined columns of both data frames but only contains 9994 rows.

However, I get more than 9994 rows upon merge. How can I make sure this does not happen?

df1 = readRDS('data1.RDS')
nrow(df1)
# [1] 9994

df2 = readRDS('data2.RDS')
nrow(df2)
# [1] 60431

df = merge(df1,df2,by=c("col1","col2"))
nrow(df)
# [1] 10057

df = merge(df1,df2,by=c("col1","col2"),all.x=TRUE)
nrow(df)
# [1] 10057
nrow(na.omit(df))
# [1] 10057

EDIT : Following akrun's comment. Yes, there were duplicates in the second data frame

nrow(unique(df2[,c("col1","col2")]))
# [1] 60263
nrow(df2)
# [1] 60431

How can I take only one row from a data frame if there are multiple for the same {col1,col2} combination. When I merge, I would like to have only 9994 rows.

410

asked May 23 '15 08:05

tubby

1 Answers

This should work, be sure to sort df2 first so you select the right rows.

df = merge(
  df1,
  df2[!duplicated(df2[, c("col1","col2")], ],
  by=c("col1","col2"),
  all.x=TRUE
)

What happens here: I merge the two data frames by the columns we want to merge by, but I first select only the first occurrence of any combination of col1 and col2 from the second data.frame df2.

duplicated checks if lines are duplicated if called with a data.frame. I select col1 and col2 from df2, so duplicated returns TRUE for rows with the same col1 and col2 but differences in other cols. Then I select only the rows which are not duplicated.

(Read the [-expressions carefully, and check the function calls from the inside out, to get the intermediate results)

edit: added explanation as suggested in comments

answered Sep 28 '22 11:09

snaut

Related questions
                            
                                R ggmap: Why can I create rectangular maps using the filename attribute, but not use them in a plot?
                            
                                Reading csv file with Japanese characters into R
                            
                                R plot using tikzDevice in LaTeX document with knitr
                            
                                How to fit two random effects separately in lme?
                            
                                OpenBLAS routine used from R/Rcpp runs only on a single core in linux
                            
                                RCurl and self-signed certificate issues
                            
                                Display functions like '+' or '[' as is
                            
                                R markdown files overlap figures when parallelized using Makefile
                            
                                Approaches to preserving object's attributes during extract/replace operations
                            
                                How to combine 4 pairs plots in one single figure?
                            
                                R Shiny in Memory Application or noSQL
                            
                                Insert a link into the navbar in shiny
                            
                                Returning integer values from RadioButton in Shiny
                            
                                knitr/Rmd: Adding title page and text when converting to MS Word
                            
                                How to can plot with R pdf device while labels are arabic or persian fonts?
                            
                                How can I speed up a topic model in R?
                            
                                test for whether package is being checked by CRAN
                            
                                Streaming Command Failed! in RHADOOP
                            
                                in R, plot sucessive/sequence events
                            
                                Why R package lubridate can't parse vector with multiple formats?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

merge in R results in more rows than one of the data frames

Tags:

merge

dataframe

r

rstudio

tubby

People also ask

1 Answers

snaut

Recent Activity

Donate For Us