My problem is very similar to the one posted here.
The difference is that they knew the columns that would be conflicting whereas I need a generic method that wont know in advance which columns conflict.
example:
TABLE1
Date Time ColumnA ColumnB
01/01/2013 08:00 10 30
01/01/2013 08:30 15 25
01/01/2013 09:00 20 20
02/01/2013 08:00 25 15
02/01/2013 08:30 30 10
02/01/2013 09:00 35 5
TABLE2
Date ColumnA ColumnB ColumnC
01/01/2013 100 300 1
02/01/2013 200 400 2
Table 2 only has dates and so is applied to all fields in table A that match the date regardless on time.
I would like the merge to sum the conflicting columns into 1. The result should look like this:
TABLE3
Date Time ColumnA ColumnB ColumnC
01/01/2013 08:00 110 330 1
01/01/2013 08:30 115 325 1
01/01/2013 09:00 120 320 1
02/01/2013 08:00 225 415 2
02/01/2013 08:30 230 410 2
02/01/2013 09:00 235 405 2
At the moment my standard merge just creates duplicate columns of "ColumnA.x", "ColumnA.y", "ColumnB.x", "ColumnB.y".
Any help is much appreciated
The merge() function in base R can be used to merge input dataframes by common columns or row names. The merge() function retains all the row names of the dataframes, behaving similarly to the inner join. The dataframes are combined in order of the appearance in the input function call.
In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS.
If we want to merge more than two dataframes we can use cbind() function and pass the resultant cbind() variable into as. list() function to convert it into list .
If I understand correctly, you want a flexible method that does not require knowing which columns exist in each table aside from the columns you want to merge by and the columns you want to preserve. This may not be the most elegant solution, but here is an example function to suit your exact needs:
merge_Sum <- function(.df1, .df2, .id_Columns, .match_Columns){
merged_Columns <- unique(c(names(.df1),names(.df2)))
merged_df1 <- data.frame(matrix(nrow=nrow(.df1), ncol=length(merged_Columns)))
names(merged_df1) <- merged_Columns
for (column in merged_Columns){
if(column %in% .id_Columns | !column %in% names(.df2)){
merged_df1[, column] <- .df1[, column]
} else if (!column %in% names(.df1)){
merged_df1[, column] <- .df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
} else {
df1_Values=.df1[, column]
df2_Values=.df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
df2_Values[is.na(df2_Values)] <- 0
merged_df1[, column] <- df1_Values + df2_Values
}
}
return(merged_df1)
}
This function assumes you have a table '.df1' that is a master of sorts, and you want to merge data from a second table '.df2' that has rows that match one or more of the rows in '.df1'. The columns to preserve from the master table '.df1' are accepted as an array '.id_Columns', and the columns that provide the match for merging the two tables are accepted as an array '.match_Columns'
For your example, it would work like this:
merge_Sum(table1, table2, c("Date","Time"), "Date")
# Date Time ColumnA ColumnB ColumnC
# 1 01/01/2013 08:00 110 330 1
# 2 01/01/2013 08:30 115 325 1
# 3 01/01/2013 09:00 120 320 1
# 4 02/01/2013 08:00 225 415 2
# 5 02/01/2013 08:30 230 410 2
# 6 02/01/2013 09:00 235 405 2
In plain language, this function first finds the total number of unique columns and makes an empty data frame in the shape of the master table '.df1' to later hold the merged data. Then, for the '.id_Columns', the data is copied from '.df1' into the new merged data frame. For the other columns, any data that exists in '.df1' is added to any existing data in '.df2', where the rows in '.df2' are matched based on the '.match_Columns'
There is probably some package out there that does something similar, but most of them require knowledge of all the existing columns and how to treat them. As I said before, this is not the most elegant solution, but it is flexible and accurate.
Update: The original function assumed a many-to-one relationship between table1 and table2, and the OP requested the allowance of a many-to-none relationship, also. The code has been updated with a slightly less efficient but 100% more flexible logic.
A data.table
solution:
dt1 <- data.table(read.table(header=T, text="Date Time ColumnA ColumnB
01/01/2013 08:00 10 30
01/01/2013 08:30 15 25
01/01/2013 09:00 20 20
02/01/2013 08:00 25 15
02/01/2013 08:30 30 10
02/01/2013 09:00 35 5"))
dt2 <- data.table(read.table(header=T, text="Date ColumnA ColumnB ColumnC
01/01/2013 100 300 1
02/01/2013 200 400 2"))
setkey(dt1, "Date")
setkey(dt2, "Date")
# Note: The ColumnC assignment has to be come before the summing operations
# Else it gives out error (see below)
dt1[dt2, `:=`(ColumnC = i.ColumnC, ColumnA = ColumnA + i.ColumnA,
ColumnB = ColumnB + i.ColumnB)]
# Date Time ColumnA ColumnB ColumnC
# 1: 01/01/2013 08:00 110 330 1
# 2: 01/01/2013 08:30 115 325 1
# 3: 01/01/2013 09:00 120 320 1
# 4: 02/01/2013 08:00 225 415 2
# 5: 02/01/2013 08:30 230 410 2
# 6: 02/01/2013 09:00 235 405 2
I'm not sure why placing ColumnC
assignment on the right end throws this error. Perhaps MatthewDowle could explain the cause for this error.
dt1[dt2, `:=`(ColumnA = ColumnA + i.ColumnA, ColumnB = ColumnB + i.ColumnB,
ColumnC = i.ColumnC)]
Error in `[.data.table`(dt1, dt2, `:=`(ColumnA = ColumnA + i.ColumnA, :
Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'NULL'
Update from v1.8.9 :
o Mixing adding new with updating existing columns into one
:=
() by group; i.e.,DT[,
:=(existingCol=...,newCol=...), by=...]
now works without error or segfault, #2778 and #2528. Many thanks to Arun for reporting both with reproducible examples. Tests added.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With