I'm trying to merge 4 dataframes based on 2 columns, but keep track of which dataframe a column originated from. I'm running into an issue at tracking the columns.
(see end of post of dput(dfs))
#df example (df1)
Name Color Freq
banana yellow 3
apple red 1
apple green 4
plum purple 8
#create list of dataframes
list.df <- list(df1, df2, df3, df4)
#merge dfs on column "Name" and "Color"
combo.df <- Reduce(function(x,y) merge(x,y, by = c("Name", "Color"), all = TRUE, accumulate=FALSE, suffixes = c(".df1", ".df2", ".df3", ".df4")), list.df)
This gives the following warning:
Warning message: In merge.data.frame(x, y, by = c("Name", "Color"), all = TRUE, : column names ‘Freq.df1’, ‘Freq.df2’ are duplicated in the result
and outputs this dataframe:
#combo df example
Name Color Freq.df1 Freq.df2 Freq.df1 Freq.df2
banana yellow 3 3 7 NA
apple red 1 2 9 1
apple green 4 NA 8 2
plum purple 8 1 NA 6
df1
and df2
are only repeated in name. The values populating the third and fourth column of combo
are actually from df3
and df4
respectively.
What I would really like is:
Name Color Freq.df1 Freq.df2 Freq.df3 Freq.df4
banana yellow 3 3 7 NA
apple red 1 2 9 1
apple green 4 NA 8 2
plum purple 8 1 NA 6
How can I achieve this? I know the merge(..., suffixes)
function can only handle a character vector of 2, but I don't know what the work around should be. Thanks!
df1 <-
structure(list(Name = structure(c(2L, 1L, 1L, 3L), .Label = c("apple",
"banana", "plum"), class = "factor"), Color = structure(c(4L,
3L, 1L, 2L), .Label = c("green", "purple", "red", "yellow"), class = "factor"),
Freq = c(3, 1, 4, 8)), .Names = c("Name", "Color", "Freq"
), row.names = c(NA, -4L), class = "data.frame")
df2 <-
structure(list(Name = structure(c(2L, 1L, 3L), .Label = c("apple",
"banana", "plum"), class = "factor"), Color = structure(c(3L,
2L, 1L), .Label = c("purple", "red", "yellow"), class = "factor"),
Freq = c(3, 2, 1)), .Names = c("Name", "Color", "Freq"), row.names = c(NA,
-3L), class = "data.frame")
df3 <-
structure(list(Name = structure(c(2L, 1L, 1L), .Label = c("apple",
"banana"), class = "factor"), Color = structure(c(3L, 2L, 1L), .Label = c("green",
"red", "yellow"), class = "factor"), Freq = c(7, 9, 8)), .Names = c("Name",
"Color", "Freq"), row.names = c(NA, -3L), class = "data.frame")
df4 <-
structure(list(Name = structure(c(1L, 1L, 2L), .Label = c("apple",
"plum"), class = "factor"), Color = structure(c(3L, 1L, 2L), .Label = c("green",
"purple", "red"), class = "factor"), Freq = c(1, 2, 6)), .Names = c("Name",
"Color", "Freq"), row.names = c(NA, -3L), class = "data.frame")
This seems to be easier with a for
loop as the Reduce
or reduce
(purrr
) at a time takes only two datasets, so we can't have more than two suffixes
in the merge
.
Here, we created a vector of suffixes ('sfx'). Initialize an output dataset with the first list
element. Then loop through the sequence of 'list.df' and do a sequential merge
with the 'res' and the next element of list.df
while updating the 'res' in each step
sfx <- c(".df1", ".df2", ".df3", ".df4")
res <- list.df[[1]]
for(i in head(seq_along(list.df), -1)) {
res <- merge(res, list.df[[i+1]], all = TRUE,
suffixes = sfx[i:(i+1)], by = c("Name", "Color"))
}
res
# Name Color Freq.df1 Freq.df2 Freq.df3 Freq.df4
#1 apple green 4 NA 8 2
#2 apple red 1 2 9 1
#3 banana yellow 3 3 7 NA
#4 plum purple 8 1 NA 6
I finally could make this one work using Reduce
function itself. To do so I modified the input in a particular format.
As we could not pass the names of the data.frame
as parameter inside the Reduce
function, I created a list with an attribute n
containing the name of the data.frame.
lst=list(list(n="df1",df=df1),list(n="df2",df=df2),list(n="df3",df=df3), list(n="df4",df=df4))
Around that I have built the logic to track the name of the data.frames
being processed.
Reduce(function(x,y){
if(ncol(x$df)==3){
#df column names after 1st merge.
namecol=c('Name','Color',paste0("Freq.",x$n),paste0("Freq.",y$n))
}else{
#df column names for remaining merges.
namecol=c(colnames(x$df),paste0("Freq.",y$n))
}
df=merge.data.frame(x = x$df,y = y$df,by = c("Name","Color"),all = TRUE)
colnames(df)=namecol
list(n="df",df=df)},lst)
#$n
#[1] "df"
#$df
# Name Color Freq.df1 Freq.df2 Freq.df3 Freq.df4
#1 apple green 4 NA 8 2
#2 apple red 1 2 9 1
#3 banana yellow 3 3 7 NA
#4 plum purple 8 1 NA 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With