Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine list of data frames in R

I have a list of data frames in R

Each data frame's rownames are dates and they have a column named number.

I want to create a single data frame from the list such that the columnames are the same as list element names

Eg, if

    df1 <- data.frame( number = c(10, 20, 30))
    rownames(df1) <- c("202201", "202202", "202203")

    df2 <- data.frame( number = c(20, 30, 30))
    rownames(df2) <- c("202201", "202202", "202203")

    df <- list(A = df1, B = df2)

I want to combine df such that result is

       | A        |  B       |       
  
202201 | 10       | 20       |
202202 | 20       | 30       |
202203 | 30       | 30       |

The actual data contains multiple data frames and the range of date, i.e the rownames may also differ.

I would prefer a full join

I tried using the following method

merged_df <- Reduce(function(x, y) merge(x, y, by = 0, all = TRUE), df)

I got the warning

"In merge.data.frame() column names are dupicated in the result "

The output contains multiple columns with row.names and number.x , number.y

like image 838
Siddharth Somani Avatar asked Dec 07 '25 07:12

Siddharth Somani


2 Answers

You're on the right start:

# unchanged
out <- Reduce(function(x, y) merge(x, y, by = 0, all = TRUE), df)
# remove `Row.names` column
rownames(out) <- out[[1]]
# change all remaining names
out <- setNames(out[,-1], names(df))
out
#         A  B
# 202201 10 20
# 202202 20 30
# 202203 30 30

But I don't get your error, even without the new code. I suspect that having more in the df list is causing some problem.

df3 <- data.frame(number = c(20, 40, 60))
rownames(df3) <- c("202201", "202202", "202204")
df <- list(A = df1, B = df2, C = df3)
out <- Reduce(function(x, y) merge(x, y, by = 0, all = TRUE), df)
# Warning in merge.data.frame(x, y, by = 0, all = TRUE) :
#   column name ‘Row.names’ is duplicated in the result

This is because the first merge results in a frame without row names, so therefore the magic of by=0 (or by="row.names") can no longer work.

merge(df1, df2, by=0, all=T)
#   Row.names number.x number.y
# 1    202201       10       20
# 2    202202       20       30
# 3    202203       30       30
#### note: no row names, additional column

We have two options:

  1. Pre-shift all frames to have the row-names as a real column. (Personally I prefer the first step here, since I don't trust row-names and either ignore or actively remove them from my data. dplyr and similar packages often removes them as well, fyi, in case you're thinking of using anything from the tidyverse. But I'll continue without that preference.)

    out <- lapply(df, function(obj) transform(obj, rowname = rownames(obj))) |>
      Reduce(f = function(x, y) merge(x, y, by = "rowname", all = TRUE))
    out
    #   rowname number.x number.y number
    # 1  202201       10       20     20
    # 2  202202       20       30     40
    # 3  202203       30       30     NA
    # 4  202204       NA       NA     60
    

    which looks more reasonable. From here, the same cleanup as above will restore the row-names as you want.

    rownames(out) <- out[[1]]
    out <- setNames(out[,-1], names(df))
    out
    #         A  B  C
    # 202201 10 20 20
    # 202202 20 30 40
    # 202203 30 30 NA
    # 202204 NA NA 60
    
  2. Shift the new Row.names column back to be row-names, internally.

    out <- Reduce(function(x, y) {
      out <- merge(x, y, by = 0, all = TRUE)
      rownames(out) <- out$Row.names
      out[,-1]
    }, df)
    names(out) <- names(df)
    out
    #         A  B  C
    # 202201 10 20 20
    # 202202 20 30 40
    # 202203 30 30 NA
    # 202204 NA NA 60
    
like image 196
r2evans Avatar answered Dec 08 '25 21:12

r2evans


Using base, I'd name the columns first using a simple loop, and then call merge in a Reduce().

However, when you merge based on row names, merge turns the row names into a column. Because of this, the merging 2 data frames by row names works just fine, but if we try to add a 3rd there is an issue because the result of the first merge doesn't have row names any more. To solve this, we'll convert the row names to a column in all the data frames before merging, then convert back at the end.

## make the example a tiny bit more complicated
## to make sure the row names are not an issue
## (thanks to @r2evans)
df3 <- data.frame(number = c(20, 40, 60))
rownames(df3) <- c("202201", "202202", "202204")
df <- list(A = df1, B = df2, C = df3)

# make the column names same as list names
# and add row names as an actual column
for(i in seq_along(df)) {
  names(df[[i]]) = names(df)[i]
  df[[i]]$merge_id = row.names(df[[i]])
}

# merge by our ID
result = Reduce(\(x, y) merge(x, y, by = "merge_id", all = TRUE), df)
# convert back to row_names
row.names(result) = result$merge_id
result$merge_id = NULL
result
#         A  B  C
# 202201 10 20 20
# 202202 20 30 40
# 202203 30 30 NA
# 202204 NA NA 60
like image 44
Gregor Thomas Avatar answered Dec 08 '25 21:12

Gregor Thomas