Below I define a master dataset of dimensions 12x5. I divide it into four data.tables and I want to merge them. There is no row ID overlap between data.tables and some column name overlap. When I merge them, merge()
doesn't recognize column name matches, and creates new columns for every column in each data.table. The final merged data.table should be 12x5, but it is coming out as 12x7. I thought that the all=TRUE
command in data.table's merge()
would solve this.
library(data.table)
a <- data.table(id = c(1, 2, 3), C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6), C1 = c(1, 2, 3), C2 = c(2, 3, 4))
c <- data.table(id = c(7, 8, 9), C3 = c(5, 2, 7))
d <- data.table(id = c(10, 11, 12), C3 = c(8, 2, 3), C4 = c(4, 6, 8))
setkey(a, "id")
setkey(b, "id")
setkey(c, "id")
setkey(d, "id")
final <- merge(a, b, all = TRUE)
final <- merge(final, c, all = TRUE)
final <- merge(final, d, all = TRUE)
names(final)
dim(final) #outputs correct numb of rows, but too many columns
One base R way to do this is with the merge() function, using the basic syntax merge(df1, df2) . The order of data frame 1 and data frame 2 doesn't matter, but whichever one is first is considered x and the second one is y.
How do I concatenate two columns in R? To concatenate two columns you can use the <code>paste()</code> function. For example, if you want to combine the two columns A and B in the dataframe df you can use the following code: <code>df['AB'] <- paste(df$A, df$B)</code>.
To combine two data frames with same columns in R language, call rbind() function, and pass the two data frames, as arguments. rbind() function returns the resulting data frame created from concatenating the given two data frames. For rbind() function to combine the given data frames, the column names must match.
In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package.
The problem is with the way you are using the 'merge' function. 'merge' function in data.table package by default merges two data tables by the "shared key columns between them". Suppose you create 'a' and 'b' data tables like this:
library(data.table)
a <- data.table(id = c(1, 2, 3), C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6), C1 = c(1, 2, 3), C2 = c(2, 3, 4))
setkey(a, "id")
setkey(b, "id")
where 'a' is going to be like:
id C1
1: 1 1
2: 2 2
3: 3 3
and 'b' is going to be like:
id C1 C2
1: 4 1 2
2: 5 2 3
3: 6 3 4
now, lets first try your code:
merge(a, b, all = TRUE)
This is the result:
id C1.x C1.y C2
1: 1 1 NA NA
2: 2 2 NA NA
3: 3 3 NA NA
4: 4 NA 1 2
5: 5 NA 2 3
6: 6 NA 3 4
This is due to the fact that 'merge' function is taking only 'id' field (shared key between data tables 'a' and 'b') as the merging column, while adding all non-shared columns to the resulting data table. Now lets try specifying what columns to merge on:
merge(a, b, by=c("id","C1"), all = TRUE)
now the result is going to be:
id C1 C2
1: 1 1 NA
2: 2 2 NA
3: 3 3 NA
4: 4 1 2
5: 5 2 3
6: 6 3 4
Same applies to other merge functions you called. So try this:
final <- merge(a, b, by=c("id","C1"), all = TRUE)
final <- merge(final, c, by="id", all = TRUE) #here you don't necessarily need to specify by...
final <- merge( final, d, by=c("id","C3"),all=TRUE)
dim(final)
[1] 12 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With