I want to efficiently sum the entries of two data frames, though the data frames are not guaranteed to have the same dimensions or column names. Merge isn't really what I'm after here. Instead I want to create an output object with all of the row and column names that belong to either of the added data frames. In each position of that output, I want to use the following logic for the computed value: <ul> <li>If a row/column pairing belongs to both input data frames I want the output to include their sum</li> <li>If a row/column pairing belongs to just one input data frame I want to include that value in the output</li> <li>If a row/column pairing does not belong to any input matrix I want to have 0 in that position in the output.</li> </ul> As an example, consider the following input data frames: <pre class="prettyprint"><code>df1 = data.frame(x = c(1,2,3), y = c(4,5,6)) rownames(df1) = c("a", "b", "c") df2 = data.frame(x = c(7,8), z = c(9,10), w = c(2, 3)) rownames(df2) = c("a", "d") > df1 x y a 1 4 b 2 5 c 3 6 > df2 x z w a 7 9 2 d 8 10 3 </code></pre> I want the final result to be <pre class="prettyprint"><code>> df2 x y z w a 8 4 9 2 b 2 5 0 0 c 3 6 0 0 d 8 0 10 3 </code></pre> What I've done so far - bind_rows / bind_cols in dplyr can throw the following: "Error: incompatible number of rows (3, expecting 2)" I have duplicated column names, so 'merge' isn't working for my purposes either - returns an empty df for some reason.

Seems like you could merge on the rownames, then take care of the sums and conversion of <code>NA</code> to zero with some additional munging: <pre class="prettyprint"><code>library(dplyr) df.new = df1 %>% add_rownames %>% full_join(df2 %>% add_rownames, by="rowname") %>% mutate_each(funs(replace(., which(is.na(.)), 0))) %>% mutate(x = x.x + x.y) %>% select(rowname,x,y,z,w) </code></pre> Or, with @DavidArenburg's much more elegant and extensible solution: <pre class="prettyprint"><code>df.new = df1 %>% add_rownames %>% full_join(df2 %>% add_rownames) %>% group_by(rowname) %>% summarise_each(funs(sum(., na.rm = TRUE))) df.new rowname x y z w 1 a 8 4 9 2 2 b 2 5 0 0 3 c 3 6 0 0 4 d 8 0 10 3 </code></pre>

First, I would grab the names of all the rows and columns of the new entity: <pre class="prettyprint"><code>(all.rows <- unique(c(row.names(df1), row.names(df2)))) # [1] "a" "b" "c" "d" (all.cols <- unique(c(names(df1), names(df2)))) # [1] "x" "y" "z" "w" </code></pre> Then I would construct an output matrix with those rows and column names (with matrix data initialized to all 0s), adding <code>df1</code> and <code>df2</code> to the relevant parts of that matrix. <pre class="prettyprint"><code>out <- matrix(0, nrow=length(all.rows), ncol=length(all.cols)) rownames(out) <- all.rows colnames(out) <- all.cols out[row.names(df1),names(df1)] <- unlist(df1) out[row.names(df2),names(df2)] <- out[row.names(df2),names(df2)] + unlist(df2) out # x y z w # a 8 4 9 2 # b 2 5 0 0 # c 3 6 0 0 # d 8 0 10 3 </code></pre>

Add (not merge!) two data frames with unequal rows and columns

Tags:

dataframe

r

I want to efficiently sum the entries of two data frames, though the data frames are not guaranteed to have the same dimensions or column names. Merge isn't really what I'm after here. Instead I want to create an output object with all of the row and column names that belong to either of the added data frames. In each position of that output, I want to use the following logic for the computed value:

If a row/column pairing belongs to both input data frames I want the output to include their sum
If a row/column pairing belongs to just one input data frame I want to include that value in the output
If a row/column pairing does not belong to any input matrix I want to have 0 in that position in the output.

As an example, consider the following input data frames:

df1 = data.frame(x = c(1,2,3), y = c(4,5,6))
rownames(df1) = c("a", "b", "c")
df2 = data.frame(x = c(7,8), z = c(9,10), w = c(2, 3))
rownames(df2) = c("a", "d")
> df1
  x y
a 1 4
b 2 5
c 3 6
> df2
  x  z  w 
a 7  9  2
d 8 10  3

I want the final result to be

> df2
   x  y   z  w
a  8  4   9  2
b  2  5   0  0
c  3  6   0  0
d  8  0  10  3

What I've done so far -

bind_rows / bind_cols in dplyr can throw the following: "Error: incompatible number of rows (3, expecting 2)"

I have duplicated column names, so 'merge' isn't working for my purposes either - returns an empty df for some reason.

967

asked Feb 02 '16 20:02

Jeff Shane

3 Answers

Seems like you could merge on the rownames, then take care of the sums and conversion of NA to zero with some additional munging:

library(dplyr)

df.new = df1 %>% add_rownames %>%
  full_join(df2 %>% add_rownames, by="rowname") %>%
  mutate_each(funs(replace(., which(is.na(.)), 0))) %>%
  mutate(x = x.x + x.y) %>%
  select(rowname,x,y,z,w)

Or, with @DavidArenburg's much more elegant and extensible solution:

df.new = df1 %>% add_rownames %>% 
  full_join(df2 %>% add_rownames) %>% 
  group_by(rowname) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

df.new

  rowname     x     y     z     w
1       a     8     4     9     2
2       b     2     5     0     0
3       c     3     6     0     0
4       d     8     0    10     3

190

answered Oct 10 '22 09:10

eipi10

This seems like some type of a simple merge on common column names (+ row names) and then a simple aggregation, this is how I would tackle this

library(data.table)
merge(setDT(df1, keep.rownames = TRUE), # Convert to data.table + keep rows
      setDT(df2, keep.rownames = TRUE), # Convert to data.table + keep rows
      by = intersect(names(df1), names(df2)), # merge on common column names
      all = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = rn] # Sum all columns by group                   
#    rn x y  z w
# 1:  a 8 4  9 2
# 2:  b 2 5  0 0
# 3:  c 3 6  0 0
# 4:  d 8 0 10 3

Are a pretty straight forward base R solution

df1$rn <- row.names(df1)
df2$rn <- row.names(df2)
res <- merge(df1, df2, all = TRUE)
rowsum(res[setdiff(names(res), "rn")], res[, "rn"], na.rm = TRUE)
#   x y  z w
# a 8 4  9 2
# b 2 5  0 0
# c 3 6  0 0
# d 8 0 10 3

answered Oct 10 '22 07:10

David Arenburg

First, I would grab the names of all the rows and columns of the new entity:

(all.rows <- unique(c(row.names(df1), row.names(df2))))
# [1] "a" "b" "c" "d"
(all.cols <- unique(c(names(df1), names(df2))))
# [1] "x" "y" "z" "w"

Then I would construct an output matrix with those rows and column names (with matrix data initialized to all 0s), adding df1 and df2 to the relevant parts of that matrix.

out <- matrix(0, nrow=length(all.rows), ncol=length(all.cols))
rownames(out) <- all.rows
colnames(out) <- all.cols
out[row.names(df1),names(df1)] <- unlist(df1)
out[row.names(df2),names(df2)] <- out[row.names(df2),names(df2)] + unlist(df2)
out
#   x y  z w
# a 8 4  9 2
# b 2 5  0 0
# c 3 6  0 0
# d 8 0 10 3

answered Oct 10 '22 07:10

josliber

Related questions
                            
                                Why does an ellipse change orientation when the graphics window is not square?
                            
                                Using ggplot2 with columns that have spaces in their names
                            
                                R forecast season and trend of data using stl and arima
                            
                                Get decision tree rule/path pattern for every row of predicted dataset for rpart/ctree package in R
                            
                                Prediction - Neural network for regression predicts same value
                            
                                *Efficiently* moving dataframes from Pandas to R with RPy (or other means)
                            
                                Graphic not appearing in R: null device?
                            
                                How to remove R's startup messages in console mode?
                            
                                Why might one load a library more than once in an R script?
                            
                                Get margin line locations in log space
                            
                                Unlist nested list columns in data.table
                            
                                In knitr, no output from pander in for loop
                            
                                Broom/Dplyr error with glance() when using lm instead of biglm
                            
                                How to use Pearson Correlation as distance metric in Scikit-learn Agglomerative clustering
                            
                                ggplot function to add text just below legend
                            
                                Cannot insert plot into XLSX via openxlsx package when using command line
                            
                                solution to the warning message using glmer
                            
                                with_tz with a vector of timezones
                            
                                ggplot2: multiple plots in a single row with a single legend
                            
                                Building a binary sparkline plot in R with ggplot2 barplot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With