Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge multiple data.frames and sum and average columns at the same time in R

I have over 20 twenty data.frames with the same columns but differing amount of rows. My goal is to merge the data.frames by the column "Name" (which is a list of five names) and while merging I would like the rows with the same name to sum column A, sum column B, and get the mean of column C.

Here is what I am currently doing.

First I will just merge 2 data.frames at a time.

DF <- merge(x=abc, y=def, by = "Name", all = T)

Merged DF will look like such

Name                              A.x    B.x   C.x   A.y   B.y  C.y
name1,name2,name3,name4,name5      11     24     7    NA    NA   NA
name1,name3,name4,name6,name7       4      8    12     3     4    7
name1,name2,name5,name6,name7      12      4     5    NA    NA   NA
name3,name4,name5,name6,name7      NA     NA    NA    15     3   28

I will then add these ifelse statements to deal with the NAs and non unique rows. For the non unique rows it will add for A add for B and for C it will get an average.

DF$A <- ifelse(is.na(DF$A.x), DF$A.y,
    ifelse(is.na(DF$A.y), DF$A.x,
        ifelse((!is.na(DF$A.x)) & (!is.na(DF$A.y)), DF$A.x + DF$A.y, 1)))
DF$B <- ifelse(is.na(DF$B.x), DF$B.y, 
    ifelse(is.na(DF$B.y), DF$B.x,
        ifelse((!is.na(DF$B.x)) & (!is.na(DF$B.y)), DF$B.x + DF$B.y, 1)))
DF$C <- ifelse(is.na(DF$C.x), DF$C.y, 
    ifelse(is.na(DF$C.y), DF$C.x,
        ifelse((!is.na(DF$C.x)) & (!is.na(DF$C.y)), (DF$C.x  + DF$C.y)/2, 1)))

DF will now look like such

Name                            A.x  B.x  C.x  A.y  B.y  C.y   A   B   C   
name1,name2,name3,name4,name5    11   24    7   NA   NA   NA  11  24   7
name1,name3,name4,name6,name7     4    8   12    3    4    8   7  12  10
name1,name2,name5,name6,name7    12    4    5   NA   NA   NA  12   4   5
name3,name4,name5,name6,name7    NA   NA   NA   15    3   28  15   3  28

I then keep just the Name column and the last three columns

merge1 <- DF[c(1,8,9,10)]

Then I do the same process for the next two data.frames and call it merge2. Then I will merge merge1 and merge 2.

total1 <- merge(x = merge1, y = merge2, by = "Name", all = TRUE)

I will just continue to merge two data frames at a time then merge the Totals data.frames together as well two at a time. I get my end result that I want but it is a timely process and not very efficient.

Another way I think I could do it is may be do a rbind with all the data.frames then if in the Name column any row has the same list of names as another row then make that one row, add column A, add column B and get the mean of column C. But I am not sure how to do that as well.

Here is an example of what I would like with rind

Name                              A     B     C
name1,name2,name3,name4,name5    11    24     7
name1,name3,name4,name6,name7     4     8    12
name1,name2,name5,name6,name7    12     4     5 
name3,name4,name5,name6,name7    15     3    28 
name1,name3,name4,name6,name7     3     4     8

The end result would look like such

Name                              A     B     C
name1,name2,name3,name4,name5    11    24     7
name1,name3,name4,name6,name7     7    12    10
name1,name2,name5,name6,name7    12     4     5 
name3,name4,name5,name6,name7    15     3    28 

Again, I am sure there are more efficient ways to complete what I want than what I am currently doing so any help would be greatly appreciated.

like image 717
Dre Avatar asked Jan 26 '16 05:01

Dre


2 Answers

I think your second approach is the way to go, and you can do that with data.table or dplyr.

Here a few steps using data.table. First, if your data frames are abc, def, ... do:

DF <- do.call(rbind, list(abc,def,...))

now you can transform them into a data.table

DT <- data.table(DF)

and simply do something like

DTres <- DT[,.(A=sum(A, na.rm=T), B=sum(B, na.rm=T), C=mean(C,na.rm=T)),by=name]

double check the data.table vignettes to get a better idea how that package work.

like image 81
Carlos Alberto Avatar answered Sep 28 '22 02:09

Carlos Alberto


We can use dplyr

library(dplyr)
bind_rows(abc, def, ...) %>%
   group_by(name) %>%
   summarise(A= sum(A, na.rm= TRUE),
            B = sum(B, na.rm= TRUE),
            C = mean(C, na.rm=TRUE))
like image 36
akrun Avatar answered Sep 28 '22 02:09

akrun