Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge multiple data tables with duplicate column names

I am trying to merge (join) multiple data tables (obtained with fread from 5 csv files) to form a single data table. I get an error when I try to merge 5 data tables, but works fine when I merge only 4. MWE below:

# example data
DT1 <- data.table(x = letters[1:6], y = 10:15)
DT2 <- data.table(x = letters[1:6], y = 11:16)
DT3 <- data.table(x = letters[1:6], y = 12:17)
DT4 <- data.table(x = letters[1:6], y = 13:18)
DT5 <- data.table(x = letters[1:6], y = 14:19)

# this gives an error
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Error in merge.data.table(..., all = TRUE, by = "x") : x has some duplicated column name(s): y.x,y.y. Please remove or rename the duplicate(s) and try again.

# whereas this works fine
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4))

    x y.x y.y y.x y.y 
 1: a  10  11  12  13 
 2: b  11  12  13  14 
 3: c  12  13  14  15 
 4: d  13  14  15  16 
 5: e  14  15  16  17 
 6: f  15  16  17  18

I have a workaround, where, if I change the 2nd column name for DT1:

setnames(DT1, "y", "new_y")

# this works now
Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

Why does this happen, and is there any way to merge an arbitrary number of data tables with the same column names without changing any of the column names?

like image 477
srao Avatar asked Sep 11 '15 15:09

srao


People also ask

How to merge two tables with different columns in Excel?

In the Select All Columns list box, check the column name that you want to merge to the first table; (3.) Uncheck the Use original column name as prefix option. 10. Then, click OK button, now, you can see the column data in the second table has been added into the first table, see screenshot: 11.

How to combine rows based on duplicate values in another column?

Sometimes, you want to combine the rows based on duplicate values in another column, the Advanced Combine Rows of Kutools for Excel also can do a favor for you, please do as follows: 1. Select the data range that you want to use, and then click Kutools > Merge & Split > Advanced Combine Rows to enable the Advanced Combine Rows dialog box.

How to combine and sum duplicates from multiple references in Excel?

Select Sum from Function drop down list; (2.) Click button to select the range that you want to consolidate, and then click Add button to add the reference to All references list box; (3.) Check Top row and Left column from Use labels in option. See screenshot: 4. After finishing the settings, click OK, and the duplicates are combined and summed.

How to merge duplicates in Excel using for loop?

The FOR loop is used for a range of rows starting from i = 4 (as our data table started from row number 4) and the UBOUND function will determine the size of the array. After that, the Merge Duplicates in Excel (as we determined the name of this title) wizard will open.


2 Answers

Here's a way of keeping a counter within Reduce, if you want to rename during the merge:

Reduce((function() {counter = 0
                    function(x, y) {
                      counter <<- counter + 1
                      d = merge(x, y, all = T, by = 'x')
                      setnames(d, c(head(names(d), -1), paste0('y.', counter)))
                    }})(), list(DT1, DT2, DT3, DT4, DT5))
#   x y.x y.1 y.2 y.3 y.4
#1: a  10  11  12  13  14
#2: b  11  12  13  14  15
#3: c  12  13  14  15  16
#4: d  13  14  15  16  17
#5: e  14  15  16  17  18
#6: f  15  16  17  18  19
like image 71
eddi Avatar answered Oct 12 '22 12:10

eddi


If it's just those 5 datatables (where x is the same for all datatables), you could also use nested joins:

# set the key for each datatable to 'x'
setkey(DT1,x)
setkey(DT2,x)
setkey(DT3,x)
setkey(DT4,x)
setkey(DT5,x)

# the nested join
mergedDT1 <- DT1[DT2[DT3[DT4[DT5]]]]

Or as @Frank said in the comments:

DTlist <- list(DT1,DT2,DT3,DT4,DT5)
Reduce(function(X,Y) X[Y], DTlist)

which gives:

   x y1 y2 y3 y4 y5
1: a 10 11 12 13 14
2: b 11 12 13 14 15
3: c 12 13 14 15 16
4: d 13 14 15 16 17
5: e 14 15 16 17 18
6: f 15 16 17 18 19

This gives the same result as:

mergedDT2 <- Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5))

> identical(mergedDT1,mergedDT2)
[1] TRUE

When your x columns do not have the same values, a nested join will not give the desired solution:

DT1[DT2[DT3[DT4[DT5[DT6]]]]]

this gives:

   x y1 y2 y3 y4 y5 y6
1: b 11 12 13 14 15 15
2: c 12 13 14 15 16 16
3: d 13 14 15 16 17 17
4: e 14 15 16 17 18 18
5: f 15 16 17 18 19 19
6: g NA NA NA NA NA 20

While:

Reduce(function(...) merge(..., all = TRUE, by = "x"), list(DT1, DT2, DT3, DT4, DT5, DT6))

gives:

   x y1 y2 y3 y4 y5 y6
1: a 10 11 12 13 14 NA
2: b 11 12 13 14 15 15
3: c 12 13 14 15 16 16
4: d 13 14 15 16 17 17
5: e 14 15 16 17 18 18
6: f 15 16 17 18 19 19
7: g NA NA NA NA NA 20

Used data:

In order to make the code with Reduce work, I changed the names of the y columns.

DT1 <- data.table(x = letters[1:6], y1 = 10:15)
DT2 <- data.table(x = letters[1:6], y2 = 11:16)
DT3 <- data.table(x = letters[1:6], y3 = 12:17)
DT4 <- data.table(x = letters[1:6], y4 = 13:18)
DT5 <- data.table(x = letters[1:6], y5 = 14:19)

DT6 <- data.table(x = letters[2:7], y6 = 15:20, key="x")
like image 28
Jaap Avatar answered Oct 12 '22 11:10

Jaap