Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add (not merge!) two data frames with unequal rows and columns

Tags:

dataframe

r

I want to efficiently sum the entries of two data frames, though the data frames are not guaranteed to have the same dimensions or column names. Merge isn't really what I'm after here. Instead I want to create an output object with all of the row and column names that belong to either of the added data frames. In each position of that output, I want to use the following logic for the computed value:

  • If a row/column pairing belongs to both input data frames I want the output to include their sum
  • If a row/column pairing belongs to just one input data frame I want to include that value in the output
  • If a row/column pairing does not belong to any input matrix I want to have 0 in that position in the output.

As an example, consider the following input data frames:

df1 = data.frame(x = c(1,2,3), y = c(4,5,6))
rownames(df1) = c("a", "b", "c")
df2 = data.frame(x = c(7,8), z = c(9,10), w = c(2, 3))
rownames(df2) = c("a", "d")
> df1
  x y
a 1 4
b 2 5
c 3 6
> df2
  x  z  w 
a 7  9  2
d 8 10  3

I want the final result to be

> df2
   x  y   z  w
a  8  4   9  2
b  2  5   0  0
c  3  6   0  0
d  8  0  10  3

What I've done so far -

bind_rows / bind_cols in dplyr can throw the following: "Error: incompatible number of rows (3, expecting 2)"

I have duplicated column names, so 'merge' isn't working for my purposes either - returns an empty df for some reason.

like image 967
Jeff Shane Avatar asked Feb 02 '16 20:02

Jeff Shane


People also ask

How do I combine unequal data frames in R?

Use the left_join Function to Merge Two R Data Frames With Different Number of Rows. left_join is another method from the dplyr package. It takes arguments similar to the full_join function, but left_join extracts all rows from the first data frame and all columns from both of them.

How do I merge DataFrames with different columns in R?

The bind_rows() method is used to combine data frames with different columns. The column names are number may be different in the input data frames. Missing columns of the corresponding data frames are filled with NA. The output data frame contains a column only if it is present in any of the data frame.

Which function is used to merge two data frames?

join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.

How do I merge two data tables in R?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.


3 Answers

Seems like you could merge on the rownames, then take care of the sums and conversion of NA to zero with some additional munging:

library(dplyr)

df.new = df1 %>% add_rownames %>%
  full_join(df2 %>% add_rownames, by="rowname") %>%
  mutate_each(funs(replace(., which(is.na(.)), 0))) %>%
  mutate(x = x.x + x.y) %>%
  select(rowname,x,y,z,w)

Or, with @DavidArenburg's much more elegant and extensible solution:

df.new = df1 %>% add_rownames %>% 
  full_join(df2 %>% add_rownames) %>% 
  group_by(rowname) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

df.new

  rowname     x     y     z     w
1       a     8     4     9     2
2       b     2     5     0     0
3       c     3     6     0     0
4       d     8     0    10     3
like image 190
eipi10 Avatar answered Oct 10 '22 09:10

eipi10


This seems like some type of a simple merge on common column names (+ row names) and then a simple aggregation, this is how I would tackle this

library(data.table)
merge(setDT(df1, keep.rownames = TRUE), # Convert to data.table + keep rows
      setDT(df2, keep.rownames = TRUE), # Convert to data.table + keep rows
      by = intersect(names(df1), names(df2)), # merge on common column names
      all = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = rn] # Sum all columns by group                   
#    rn x y  z w
# 1:  a 8 4  9 2
# 2:  b 2 5  0 0
# 3:  c 3 6  0 0
# 4:  d 8 0 10 3

Are a pretty straight forward base R solution

df1$rn <- row.names(df1)
df2$rn <- row.names(df2)
res <- merge(df1, df2, all = TRUE)
rowsum(res[setdiff(names(res), "rn")], res[, "rn"], na.rm = TRUE)
#   x y  z w
# a 8 4  9 2
# b 2 5  0 0
# c 3 6  0 0
# d 8 0 10 3
like image 35
David Arenburg Avatar answered Oct 10 '22 07:10

David Arenburg


First, I would grab the names of all the rows and columns of the new entity:

(all.rows <- unique(c(row.names(df1), row.names(df2))))
# [1] "a" "b" "c" "d"
(all.cols <- unique(c(names(df1), names(df2))))
# [1] "x" "y" "z" "w"

Then I would construct an output matrix with those rows and column names (with matrix data initialized to all 0s), adding df1 and df2 to the relevant parts of that matrix.

out <- matrix(0, nrow=length(all.rows), ncol=length(all.cols))
rownames(out) <- all.rows
colnames(out) <- all.cols
out[row.names(df1),names(df1)] <- unlist(df1)
out[row.names(df2),names(df2)] <- out[row.names(df2),names(df2)] + unlist(df2)
out
#   x y  z w
# a 8 4  9 2
# b 2 5  0 0
# c 3 6  0 0
# d 8 0 10 3
like image 2
josliber Avatar answered Oct 10 '22 07:10

josliber