Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - reduce with merge and more than 2 suffixes (or: how to merge multiple dataframes and keep track of columns)

Tags:

merge

dataframe

r

I'm trying to merge 4 dataframes based on 2 columns, but keep track of which dataframe a column originated from. I'm running into an issue at tracking the columns.

(see end of post of dput(dfs))

#df example (df1)
Name    Color    Freq
banana  yellow   3
apple   red      1
apple   green    4
plum    purple   8


#create list of dataframes
list.df <- list(df1, df2, df3, df4)

#merge dfs on column "Name" and "Color"
combo.df <- Reduce(function(x,y) merge(x,y, by = c("Name", "Color"), all = TRUE, accumulate=FALSE, suffixes = c(".df1", ".df2", ".df3", ".df4")), list.df)

This gives the following warning:

Warning message: In merge.data.frame(x, y, by = c("Name", "Color"), all = TRUE, : column names ‘Freq.df1’, ‘Freq.df2’ are duplicated in the result

and outputs this dataframe:

#combo df example
Name    Color    Freq.df1   Freq.df2  Freq.df1  Freq.df2
banana  yellow   3          3         7         NA
apple   red      1          2         9         1
apple   green    4          NA        8         2
plum    purple   8          1         NA        6

df1 and df2 are only repeated in name. The values populating the third and fourth column of combo are actually from df3 and df4 respectively.

What I would really like is:

Name    Color    Freq.df1   Freq.df2  Freq.df3  Freq.df4
banana  yellow   3          3         7         NA
apple   red      1          2         9         1
apple   green    4          NA        8         2
plum    purple   8          1         NA        6

How can I achieve this? I know the merge(..., suffixes) function can only handle a character vector of 2, but I don't know what the work around should be. Thanks!

df1 <- 
structure(list(Name = structure(c(2L, 1L, 1L, 3L), .Label = c("apple", 
"banana", "plum"), class = "factor"), Color = structure(c(4L, 
3L, 1L, 2L), .Label = c("green", "purple", "red", "yellow"), class = "factor"), 
    Freq = c(3, 1, 4, 8)), .Names = c("Name", "Color", "Freq"
), row.names = c(NA, -4L), class = "data.frame")

df2 <-
structure(list(Name = structure(c(2L, 1L, 3L), .Label = c("apple", 
"banana", "plum"), class = "factor"), Color = structure(c(3L, 
2L, 1L), .Label = c("purple", "red", "yellow"), class = "factor"), 
    Freq = c(3, 2, 1)), .Names = c("Name", "Color", "Freq"), row.names = c(NA, 
-3L), class = "data.frame")

df3 <-
structure(list(Name = structure(c(2L, 1L, 1L), .Label = c("apple", 
"banana"), class = "factor"), Color = structure(c(3L, 2L, 1L), .Label = c("green", 
"red", "yellow"), class = "factor"), Freq = c(7, 9, 8)), .Names = c("Name", 
"Color", "Freq"), row.names = c(NA, -3L), class = "data.frame")

df4 <-
structure(list(Name = structure(c(1L, 1L, 2L), .Label = c("apple", 
"plum"), class = "factor"), Color = structure(c(3L, 1L, 2L), .Label = c("green", 
"purple", "red"), class = "factor"), Freq = c(1, 2, 6)), .Names = c("Name", 
"Color", "Freq"), row.names = c(NA, -3L), class = "data.frame")
like image 925
moxed Avatar asked Oct 15 '25 16:10

moxed


2 Answers

This seems to be easier with a for loop as the Reduce or reduce (purrr) at a time takes only two datasets, so we can't have more than two suffixes in the merge.

Here, we created a vector of suffixes ('sfx'). Initialize an output dataset with the first list element. Then loop through the sequence of 'list.df' and do a sequential merge with the 'res' and the next element of list.df while updating the 'res' in each step

sfx <- c(".df1", ".df2", ".df3", ".df4")
res <- list.df[[1]]
for(i in head(seq_along(list.df), -1)) {

 res <- merge(res, list.df[[i+1]], all = TRUE, 
                 suffixes = sfx[i:(i+1)], by = c("Name", "Color"))
  }

res
#    Name  Color Freq.df1 Freq.df2 Freq.df3 Freq.df4
#1  apple  green        4       NA        8        2
#2  apple    red        1        2        9        1
#3 banana yellow        3        3        7       NA
#4   plum purple        8        1       NA        6
like image 79
akrun Avatar answered Oct 18 '25 06:10

akrun


I finally could make this one work using Reduce function itself. To do so I modified the input in a particular format.

As we could not pass the names of the data.frame as parameter inside the Reduce function, I created a list with an attribute n containing the name of the data.frame.

lst=list(list(n="df1",df=df1),list(n="df2",df=df2),list(n="df3",df=df3), list(n="df4",df=df4))

Around that I have built the logic to track the name of the data.frames being processed.

Reduce(function(x,y){
    if(ncol(x$df)==3){
      #df column names after 1st merge.
      namecol=c('Name','Color',paste0("Freq.",x$n),paste0("Freq.",y$n))
    }else{
        #df column names for remaining merges.
        namecol=c(colnames(x$df),paste0("Freq.",y$n))
    }
    df=merge.data.frame(x = x$df,y = y$df,by = c("Name","Color"),all = TRUE)
    colnames(df)=namecol
    list(n="df",df=df)},lst)


#$n
#[1] "df"

#$df
#    Name  Color Freq.df1 Freq.df2 Freq.df3 Freq.df4
#1  apple  green        4       NA        8        2
#2  apple    red        1        2        9        1
#3 banana yellow        3        3        7       NA
#4   plum purple        8        1       NA        6
like image 44
tushaR Avatar answered Oct 18 '25 08:10

tushaR



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!