Locate and merge duplicate rows in a data.frame but ignore column order

Tags:

I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.

Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):

   name1    name2    name3     total
1  Bob      Fred     Sam       30
2  Bob      Joe      Frank     20
3  Frank    Sam      Tom       25
4  Sam      Tom      Frank     10
5  Fred     Bob      Sam       15

However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.

421

asked Jun 09 '12 06:06

jdfinch3

1 Answers

Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.

Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...

dd$lookup=apply(dd[,c("name1","name2","name3")],1,
                                  function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]

You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!

(Minor update to help OP:) And if you want a cleaned-up version of the final answer:

outdf = with(ee,data.frame(name1,name2,name3,
                           total=newtotal,stringsAsFactors=FALSE))

This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.

141

answered Sep 28 '22 01:09

Tim P

Related questions
                            
                                Check whether vector in R is sequential?
                            
                                import rpy quietly
                            
                                How can I get the frequencies of common itemsets from the apriori call in R?
                            
                                How can I use variable names to refer to data frame columns with ddply?
                            
                                annoying "feature" (or bugs?) for RODBC
                            
                                Sort xtable() output by p-value from glm model summary
                            
                                R: Apply FUN to kxk subsections of array
                            
                                Where can I find documentation on escape characters like "\"
                            
                                Colorize/highlight values of R ftable() output in knitr/Sweave rapports
                            
                                Find distance of route from get.shortest.paths()
                            
                                How to assign within apply family?
                            
                                How can I use qnorm on Rcpp?
                            
                                R - convert BIG table into matrix by column names
                            
                                Reshape data with repeated columns
                            
                                how to assign a unique identifier to multiple data frame entries
                            
                                unable to find C_kmns object when passed to .Fortran()
                            
                                geom_map borders in ggplot2 - revisited
                            
                                Faster proportion tables in R
                            
                                why causes invalid format '%d in R?
                            
                                running multiple jobs in background at same time (parallel) in r

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Locate and merge duplicate rows in a data.frame but ignore column order

Tags:

dataframe

r

duplicates

plyr

jdfinch3

People also ask

1 Answers

Tim P

Recent Activity

Donate For Us