I have over 20 twenty data.frames with the same columns but differing amount of rows. My goal is to merge the data.frames by the column "Name" (which is a list of five names) and while merging I would like the rows with the same name to sum column A, sum column B, and get the mean of column C.
Here is what I am currently doing.
First I will just merge 2 data.frames at a time.
DF <- merge(x=abc, y=def, by = "Name", all = T)
Merged DF will look like such
Name A.x B.x C.x A.y B.y C.y
name1,name2,name3,name4,name5 11 24 7 NA NA NA
name1,name3,name4,name6,name7 4 8 12 3 4 7
name1,name2,name5,name6,name7 12 4 5 NA NA NA
name3,name4,name5,name6,name7 NA NA NA 15 3 28
I will then add these ifelse
statements to deal with the NAs
and non unique rows. For the non unique rows it will add for A add for B and for C it will get an average.
DF$A <- ifelse(is.na(DF$A.x), DF$A.y,
ifelse(is.na(DF$A.y), DF$A.x,
ifelse((!is.na(DF$A.x)) & (!is.na(DF$A.y)), DF$A.x + DF$A.y, 1)))
DF$B <- ifelse(is.na(DF$B.x), DF$B.y,
ifelse(is.na(DF$B.y), DF$B.x,
ifelse((!is.na(DF$B.x)) & (!is.na(DF$B.y)), DF$B.x + DF$B.y, 1)))
DF$C <- ifelse(is.na(DF$C.x), DF$C.y,
ifelse(is.na(DF$C.y), DF$C.x,
ifelse((!is.na(DF$C.x)) & (!is.na(DF$C.y)), (DF$C.x + DF$C.y)/2, 1)))
DF will now look like such
Name A.x B.x C.x A.y B.y C.y A B C
name1,name2,name3,name4,name5 11 24 7 NA NA NA 11 24 7
name1,name3,name4,name6,name7 4 8 12 3 4 8 7 12 10
name1,name2,name5,name6,name7 12 4 5 NA NA NA 12 4 5
name3,name4,name5,name6,name7 NA NA NA 15 3 28 15 3 28
I then keep just the Name column and the last three columns
merge1 <- DF[c(1,8,9,10)]
Then I do the same process for the next two data.frames and call it merge2. Then I will merge merge1 and merge 2.
total1 <- merge(x = merge1, y = merge2, by = "Name", all = TRUE)
I will just continue to merge two data frames at a time then merge the Totals data.frames together as well two at a time. I get my end result that I want but it is a timely process and not very efficient.
Another way I think I could do it is may be do a rbind with all the data.frames then if in the Name column any row has the same list of names as another row then make that one row, add column A, add column B and get the mean of column C. But I am not sure how to do that as well.
Here is an example of what I would like with rind
Name A B C
name1,name2,name3,name4,name5 11 24 7
name1,name3,name4,name6,name7 4 8 12
name1,name2,name5,name6,name7 12 4 5
name3,name4,name5,name6,name7 15 3 28
name1,name3,name4,name6,name7 3 4 8
The end result would look like such
Name A B C
name1,name2,name3,name4,name5 11 24 7
name1,name3,name4,name6,name7 7 12 10
name1,name2,name5,name6,name7 12 4 5
name3,name4,name5,name6,name7 15 3 28
Again, I am sure there are more efficient ways to complete what I want than what I am currently doing so any help would be greatly appreciated.
I think your second approach is the way to go, and you can do that with data.table
or dplyr
.
Here a few steps using data.table
. First, if your data frames are abc
, def
, ...
do:
DF <- do.call(rbind, list(abc,def,...))
now you can transform them into a data.table
DT <- data.table(DF)
and simply do something like
DTres <- DT[,.(A=sum(A, na.rm=T), B=sum(B, na.rm=T), C=mean(C,na.rm=T)),by=name]
double check the data.table
vignettes to get a better idea how that package work.
We can use dplyr
library(dplyr)
bind_rows(abc, def, ...) %>%
group_by(name) %>%
summarise(A= sum(A, na.rm= TRUE),
B = sum(B, na.rm= TRUE),
C = mean(C, na.rm=TRUE))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With