I have a list of data frames in R
Each data frame's rownames are dates and they have a column named number.
I want to create a single data frame from the list such that the columnames are the same as list element names
Eg, if
df1 <- data.frame( number = c(10, 20, 30))
rownames(df1) <- c("202201", "202202", "202203")
df2 <- data.frame( number = c(20, 30, 30))
rownames(df2) <- c("202201", "202202", "202203")
df <- list(A = df1, B = df2)
I want to combine df such that result is
| A | B |
202201 | 10 | 20 |
202202 | 20 | 30 |
202203 | 30 | 30 |
The actual data contains multiple data frames and the range of date, i.e the rownames may also differ.
I would prefer a full join
I tried using the following method
merged_df <- Reduce(function(x, y) merge(x, y, by = 0, all = TRUE), df)
I got the warning
"In merge.data.frame() column names are dupicated in the result "
The output contains multiple columns with row.names and number.x , number.y
You're on the right start:
# unchanged
out <- Reduce(function(x, y) merge(x, y, by = 0, all = TRUE), df)
# remove `Row.names` column
rownames(out) <- out[[1]]
# change all remaining names
out <- setNames(out[,-1], names(df))
out
# A B
# 202201 10 20
# 202202 20 30
# 202203 30 30
But I don't get your error, even without the new code. I suspect that having more in the df list is causing some problem.
df3 <- data.frame(number = c(20, 40, 60))
rownames(df3) <- c("202201", "202202", "202204")
df <- list(A = df1, B = df2, C = df3)
out <- Reduce(function(x, y) merge(x, y, by = 0, all = TRUE), df)
# Warning in merge.data.frame(x, y, by = 0, all = TRUE) :
# column name ‘Row.names’ is duplicated in the result
This is because the first merge results in a frame without row names, so therefore the magic of by=0 (or by="row.names") can no longer work.
merge(df1, df2, by=0, all=T)
# Row.names number.x number.y
# 1 202201 10 20
# 2 202202 20 30
# 3 202203 30 30
#### note: no row names, additional column
We have two options:
Pre-shift all frames to have the row-names as a real column. (Personally I prefer the first step here, since I don't trust row-names and either ignore or actively remove them from my data. dplyr and similar packages often removes them as well, fyi, in case you're thinking of using anything from the tidyverse. But I'll continue without that preference.)
out <- lapply(df, function(obj) transform(obj, rowname = rownames(obj))) |>
Reduce(f = function(x, y) merge(x, y, by = "rowname", all = TRUE))
out
# rowname number.x number.y number
# 1 202201 10 20 20
# 2 202202 20 30 40
# 3 202203 30 30 NA
# 4 202204 NA NA 60
which looks more reasonable. From here, the same cleanup as above will restore the row-names as you want.
rownames(out) <- out[[1]]
out <- setNames(out[,-1], names(df))
out
# A B C
# 202201 10 20 20
# 202202 20 30 40
# 202203 30 30 NA
# 202204 NA NA 60
Shift the new Row.names column back to be row-names, internally.
out <- Reduce(function(x, y) {
out <- merge(x, y, by = 0, all = TRUE)
rownames(out) <- out$Row.names
out[,-1]
}, df)
names(out) <- names(df)
out
# A B C
# 202201 10 20 20
# 202202 20 30 40
# 202203 30 30 NA
# 202204 NA NA 60
Using base, I'd name the columns first using a simple loop, and then call merge in a Reduce().
However, when you merge based on row names, merge turns the row names into a column. Because of this, the merging 2 data frames by row names works just fine, but if we try to add a 3rd there is an issue because the result of the first merge doesn't have row names any more. To solve this, we'll convert the row names to a column in all the data frames before merging, then convert back at the end.
## make the example a tiny bit more complicated
## to make sure the row names are not an issue
## (thanks to @r2evans)
df3 <- data.frame(number = c(20, 40, 60))
rownames(df3) <- c("202201", "202202", "202204")
df <- list(A = df1, B = df2, C = df3)
# make the column names same as list names
# and add row names as an actual column
for(i in seq_along(df)) {
names(df[[i]]) = names(df)[i]
df[[i]]$merge_id = row.names(df[[i]])
}
# merge by our ID
result = Reduce(\(x, y) merge(x, y, by = "merge_id", all = TRUE), df)
# convert back to row_names
row.names(result) = result$merge_id
result$merge_id = NULL
result
# A B C
# 202201 10 20 20
# 202202 20 30 40
# 202203 30 30 NA
# 202204 NA NA 60
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With