Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table merge produces extra columns [R]

Below I define a master dataset of dimensions 12x5. I divide it into four data.tables and I want to merge them. There is no row ID overlap between data.tables and some column name overlap. When I merge them, merge() doesn't recognize column name matches, and creates new columns for every column in each data.table. The final merged data.table should be 12x5, but it is coming out as 12x7. I thought that the all=TRUE command in data.table's merge() would solve this.

library(data.table)

a <- data.table(id = c(1, 2, 3),  C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6),  C1 = c(1, 2, 3),  C2 = c(2, 3, 4))
c <- data.table(id = c(7, 8, 9),  C3 = c(5, 2, 7))
d <- data.table(id = c(10, 11, 12),  C3 = c(8, 2, 3), C4 = c(4, 6, 8))

setkey(a, "id")
setkey(b, "id")
setkey(c, "id")
setkey(d, "id")

final <- merge(a, b,  all = TRUE)
final <- merge(final, c,  all = TRUE)
final <- merge(final, d,  all = TRUE)

names(final)
dim(final)  #outputs correct numb of rows, but too many columns
like image 962
Dr. Beeblebrox Avatar asked Aug 11 '14 13:08

Dr. Beeblebrox


People also ask

How do I stitch a table together in R?

One base R way to do this is with the merge() function, using the basic syntax merge(df1, df2) . The order of data frame 1 and data frame 2 doesn't matter, but whichever one is first is considered x and the second one is y.

How do I merge columns in a table in R?

How do I concatenate two columns in R? To concatenate two columns you can use the <code>paste()</code> function. For example, if you want to combine the two columns A and B in the dataframe df you can use the following code: <code>df['AB'] <- paste(df$A, df$B)</code>.

How do I merge two data frames in the same column in R?

To combine two data frames with same columns in R language, call rbind() function, and pass the two data frames, as arguments. rbind() function returns the resulting data frame created from concatenating the given two data frames. For rbind() function to combine the given data frames, the column names must match.

How do I merge data frames in R?

In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package.


1 Answers

The problem is with the way you are using the 'merge' function. 'merge' function in data.table package by default merges two data tables by the "shared key columns between them". Suppose you create 'a' and 'b' data tables like this:

library(data.table)
a <- data.table(id = c(1, 2, 3),  C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6),  C1 = c(1, 2, 3),  C2 = c(2, 3, 4))
setkey(a, "id")
setkey(b, "id")

where 'a' is going to be like:

   id C1
1:  1  1
2:  2  2
3:  3  3

and 'b' is going to be like:

   id C1 C2
1:  4  1  2
2:  5  2  3
3:  6  3  4

now, lets first try your code:

merge(a, b,  all = TRUE)

This is the result:

   id C1.x C1.y C2
1:  1    1   NA NA
2:  2    2   NA NA
3:  3    3   NA NA
4:  4   NA    1  2
5:  5   NA    2  3
6:  6   NA    3  4

This is due to the fact that 'merge' function is taking only 'id' field (shared key between data tables 'a' and 'b') as the merging column, while adding all non-shared columns to the resulting data table. Now lets try specifying what columns to merge on:

merge(a, b, by=c("id","C1"), all = TRUE)

now the result is going to be:

   id C1 C2
1:  1  1 NA
2:  2  2 NA
3:  3  3 NA
4:  4  1  2
5:  5  2  3
6:  6  3  4

Same applies to other merge functions you called. So try this:

final <- merge(a, b, by=c("id","C1"), all = TRUE)
final <- merge(final, c, by="id", all = TRUE)  #here you don't necessarily need to specify by...
final <- merge( final, d, by=c("id","C3"),all=TRUE)

dim(final)
[1] 12  5
like image 174
R for the Win Avatar answered Oct 03 '22 05:10

R for the Win