Below I define a master dataset of dimensions 12x5. I divide it into four data.tables and I want to merge them. There is no row ID overlap between data.tables and some column name overlap. When I merge them, <code>merge()</code> doesn't recognize column name matches, and creates new columns for every column in each data.table. The final merged data.table should be 12x5, but it is coming out as 12x7. I thought that the <code>all=TRUE</code> command in data.table's <code>merge()</code> would solve this. <pre class="prettyprint"><code>library(data.table) a <- data.table(id = c(1, 2, 3), C1 = c(1, 2, 3)) b <- data.table(id = c(4, 5, 6), C1 = c(1, 2, 3), C2 = c(2, 3, 4)) c <- data.table(id = c(7, 8, 9), C3 = c(5, 2, 7)) d <- data.table(id = c(10, 11, 12), C3 = c(8, 2, 3), C4 = c(4, 6, 8)) setkey(a, "id") setkey(b, "id") setkey(c, "id") setkey(d, "id") final <- merge(a, b, all = TRUE) final <- merge(final, c, all = TRUE) final <- merge(final, d, all = TRUE) names(final) dim(final) #outputs correct numb of rows, but too many columns </code></pre>

The problem is with the way you are using the 'merge' function. 'merge' function in data.table package by default merges two data tables by the "shared key columns between them". Suppose you create 'a' and 'b' data tables like this: <pre class="prettyprint"><code>library(data.table) a <- data.table(id = c(1, 2, 3), C1 = c(1, 2, 3)) b <- data.table(id = c(4, 5, 6), C1 = c(1, 2, 3), C2 = c(2, 3, 4)) setkey(a, "id") setkey(b, "id") </code></pre> where 'a' is going to be like: <pre class="prettyprint"><code> id C1 1: 1 1 2: 2 2 3: 3 3 </code></pre> and 'b' is going to be like: <pre class="prettyprint"><code> id C1 C2 1: 4 1 2 2: 5 2 3 3: 6 3 4 </code></pre> now, lets first try your code: <pre class="prettyprint"><code>merge(a, b, all = TRUE) </code></pre> This is the result: <pre class="prettyprint"><code> id C1.x C1.y C2 1: 1 1 NA NA 2: 2 2 NA NA 3: 3 3 NA NA 4: 4 NA 1 2 5: 5 NA 2 3 6: 6 NA 3 4 </code></pre> This is due to the fact that 'merge' function is taking only 'id' field (shared key between data tables 'a' and 'b') as the merging column, while adding all non-shared columns to the resulting data table. Now lets try specifying what columns to merge on: <pre class="prettyprint"><code>merge(a, b, by=c("id","C1"), all = TRUE) </code></pre> now the result is going to be: <pre class="prettyprint"><code> id C1 C2 1: 1 1 NA 2: 2 2 NA 3: 3 3 NA 4: 4 1 2 5: 5 2 3 6: 6 3 4 </code></pre> Same applies to other merge functions you called. So try this: <pre class="prettyprint"><code>final <- merge(a, b, by=c("id","C1"), all = TRUE) final <- merge(final, c, by="id", all = TRUE) #here you don't necessarily need to specify by... final <- merge( final, d, by=c("id","C3"),all=TRUE) dim(final) [1] 12 5 </code></pre>

data.table merge produces extra columns [R]

Tags:

merge

r

data.table

Below I define a master dataset of dimensions 12x5. I divide it into four data.tables and I want to merge them. There is no row ID overlap between data.tables and some column name overlap. When I merge them, merge() doesn't recognize column name matches, and creates new columns for every column in each data.table. The final merged data.table should be 12x5, but it is coming out as 12x7. I thought that the all=TRUE command in data.table's merge() would solve this.

library(data.table)

a <- data.table(id = c(1, 2, 3),  C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6),  C1 = c(1, 2, 3),  C2 = c(2, 3, 4))
c <- data.table(id = c(7, 8, 9),  C3 = c(5, 2, 7))
d <- data.table(id = c(10, 11, 12),  C3 = c(8, 2, 3), C4 = c(4, 6, 8))

setkey(a, "id")
setkey(b, "id")
setkey(c, "id")
setkey(d, "id")

final <- merge(a, b,  all = TRUE)
final <- merge(final, c,  all = TRUE)
final <- merge(final, d,  all = TRUE)

names(final)
dim(final)  #outputs correct numb of rows, but too many columns

962

asked Aug 11 '14 13:08

Dr. Beeblebrox

1 Answers

The problem is with the way you are using the 'merge' function. 'merge' function in data.table package by default merges two data tables by the "shared key columns between them". Suppose you create 'a' and 'b' data tables like this:

library(data.table)
a <- data.table(id = c(1, 2, 3),  C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6),  C1 = c(1, 2, 3),  C2 = c(2, 3, 4))
setkey(a, "id")
setkey(b, "id")

where 'a' is going to be like:

and 'b' is going to be like:

   id C1 C2
1:  4  1  2
2:  5  2  3
3:  6  3  4

now, lets first try your code:

merge(a, b,  all = TRUE)

This is the result:

   id C1.x C1.y C2
1:  1    1   NA NA
2:  2    2   NA NA
3:  3    3   NA NA
4:  4   NA    1  2
5:  5   NA    2  3
6:  6   NA    3  4

This is due to the fact that 'merge' function is taking only 'id' field (shared key between data tables 'a' and 'b') as the merging column, while adding all non-shared columns to the resulting data table. Now lets try specifying what columns to merge on:

merge(a, b, by=c("id","C1"), all = TRUE)

now the result is going to be:

   id C1 C2
1:  1  1 NA
2:  2  2 NA
3:  3  3 NA
4:  4  1  2
5:  5  2  3
6:  6  3  4

Same applies to other merge functions you called. So try this:

final <- merge(a, b, by=c("id","C1"), all = TRUE)
final <- merge(final, c, by="id", all = TRUE)  #here you don't necessarily need to specify by...
final <- merge( final, d, by=c("id","C3"),all=TRUE)

dim(final)
[1] 12  5

174

answered Oct 03 '22 05:10

R for the Win

Related questions
                            
                                How to make beautiful borderless geographic thematic/heatmaps with weighted (survey) data in R, probably using spatial smoothing on point observations
                            
                                R: Calculating offset differences between elements in data frame with the same identifier
                            
                                Call R function from C wrapper
                            
                                error : ggplot2 doesn't know how to deal with data of class uneval
                            
                                How to draw multiple CDF plots of vectors with different number of rows
                            
                                Managing POST Requests with httpuv package for building a simple API with R
                            
                                Returning the inverse matrix from a cached object in R
                            
                                summarize all numeric columns of data frame by group in R
                            
                                R parse HTML document and use xpath to get all matches of two patterns
                            
                                Passing a vector of lambdas to Rcpp's rpois
                            
                                Plot multiple ggplot plots on a single image with left alignment of the plots and a single legend
                            
                                How do I present a variable out of sequence in R markdown?
                            
                                How to select all
                            
                                How can you tell if a pipe operator is the last (or first) in a chain?
                            
                                Three column graph
                            
                                Understanding data.table invalid .selfref warning
                            
                                For loop R create and populate new column with output
                            
                                unable to install rJava in centos R
                            
                                Using geom_boxplot with facet_grid and free_y
                            
                                Creating new SQL table from dplyr object without using R memory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With