I am trying to merge (join) multiple data tables (obtained with fread from 5 csv files) to form a single data table. I get an error when I try to merge 5 data tables, but works fine when I merge only 4. MWE below:
# example data
DT1 <- data.table(x = letters[1:6], y = 10:15)
DT2 <- data.table(x = letters[1:6], y = 11:16)
DT3 <- data.table(x = letters[1:6], y = 12:17)
DT4 <- data.table(x = letters[1:6], y = 13:18)
DT5 <- data.table(x = letters[1:6], y = 14:19)
# this gives an error
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))
Error in merge.data.table(..., all = TRUE, by = "x") : x has some duplicated column name(s): y.x,y.y. Please remove or rename the duplicate(s) and try again.
# whereas this works fine
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4))
x y.x y.y y.x y.y
1: a 10 11 12 13
2: b 11 12 13 14
3: c 12 13 14 15
4: d 13 14 15 16
5: e 14 15 16 17
6: f 15 16 17 18
I have a workaround, where, if I change the 2nd column name for DT1:
setnames(DT1, "y", "new_y")
# this works now
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))
Why does this happen, and is there any way to merge an arbitrary number of data tables with the same column names without changing any of the column names?
In the Select All Columns list box, check the column name that you want to merge to the first table; (3.) Uncheck the Use original column name as prefix option. 10. Then, click OK button, now, you can see the column data in the second table has been added into the first table, see screenshot: 11.
Sometimes, you want to combine the rows based on duplicate values in another column, the Advanced Combine Rows of Kutools for Excel also can do a favor for you, please do as follows: 1. Select the data range that you want to use, and then click Kutools > Merge & Split > Advanced Combine Rows to enable the Advanced Combine Rows dialog box.
Select Sum from Function drop down list; (2.) Click button to select the range that you want to consolidate, and then click Add button to add the reference to All references list box; (3.) Check Top row and Left column from Use labels in option. See screenshot: 4. After finishing the settings, click OK, and the duplicates are combined and summed.
The FOR loop is used for a range of rows starting from i = 4 (as our data table started from row number 4) and the UBOUND function will determine the size of the array. After that, the Merge Duplicates in Excel (as we determined the name of this title) wizard will open.
Here's a way of keeping a counter within Reduce
, if you want to rename during the merge:
Reduce((function() {counter = 0
function(x, y) {
counter <<- counter + 1
d = merge(x, y, all = T, by = 'x')
setnames(d, c(head(names(d), -1), paste0('y.', counter)))
}})(), list(DT1, DT2, DT3, DT4, DT5))
# x y.x y.1 y.2 y.3 y.4
#1: a 10 11 12 13 14
#2: b 11 12 13 14 15
#3: c 12 13 14 15 16
#4: d 13 14 15 16 17
#5: e 14 15 16 17 18
#6: f 15 16 17 18 19
If it's just those 5 datatables (where x
is the same for all datatables), you could also use nested joins:
# set the key for each datatable to 'x'
setkey(DT1,x)
setkey(DT2,x)
setkey(DT3,x)
setkey(DT4,x)
setkey(DT5,x)
# the nested join
mergedDT1 <- DT1[DT2[DT3[DT4[DT5]]]]
Or as @Frank said in the comments:
DTlist <- list(DT1,DT2,DT3,DT4,DT5)
Reduce(function(X,Y) X[Y], DTlist)
which gives:
x y1 y2 y3 y4 y5
1: a 10 11 12 13 14
2: b 11 12 13 14 15
3: c 12 13 14 15 16
4: d 13 14 15 16 17
5: e 14 15 16 17 18
6: f 15 16 17 18 19
This gives the same result as:
mergedDT2 <- Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))
> identical(mergedDT1,mergedDT2)
[1] TRUE
When your x
columns do not have the same values, a nested join will not give the desired solution:
DT1[DT2[DT3[DT4[DT5[DT6]]]]]
this gives:
x y1 y2 y3 y4 y5 y6
1: b 11 12 13 14 15 15
2: c 12 13 14 15 16 16
3: d 13 14 15 16 17 17
4: e 14 15 16 17 18 18
5: f 15 16 17 18 19 19
6: g NA NA NA NA NA 20
While:
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5, DT6))
gives:
x y1 y2 y3 y4 y5 y6
1: a 10 11 12 13 14 NA
2: b 11 12 13 14 15 15
3: c 12 13 14 15 16 16
4: d 13 14 15 16 17 17
5: e 14 15 16 17 18 18
6: f 15 16 17 18 19 19
7: g NA NA NA NA NA 20
Used data:
In order to make the code with Reduce
work, I changed the names of the y
columns.
DT1 <- data.table(x = letters[1:6], y1 = 10:15)
DT2 <- data.table(x = letters[1:6], y2 = 11:16)
DT3 <- data.table(x = letters[1:6], y3 = 12:17)
DT4 <- data.table(x = letters[1:6], y4 = 13:18)
DT5 <- data.table(x = letters[1:6], y5 = 14:19)
DT6 <- data.table(x = letters[2:7], y6 = 15:20, key="x")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With