Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging tables that are sorted differently with R

I have 2 big tables. One with identifiers (unsorted), another with a list of identifiers (containing all which are in the first table) plus the associated values for one variable. I want to add a column to my first table containing the associated values that are in the second table. Is there an smart way to proceed using implemented functions of R ?

i.e.

table 1
id
8979786
62782
6268768
6776566

table 2
id        var
1          5
2          2
3          NA
…
9999999    6

and the result should be

table1
id       var
8979786   5
62782     NA
6268768   7
4776566   4

Thanks in advance

like image 414
freya Avatar asked Feb 03 '26 14:02

freya


2 Answers

So the id column is in both tables? You can merge them together: merge(table1, table2, sort = FALSE). There are lots of options to explore for merge that let you emulate different types of joins, similar to inner, left, right, and outer joins in SQL. I added the additional parameter sort here to preserve the original order of table1.

If there are ids in table1 but not table 2 and you want to show NAs for those, add all.x = TRUE as a parameter. This is equivalent to a left join. all.y is a right join, and all = TRUE is equivalent to a full outer join.

Reproducible example:

> set.seed(1)
> table1 <- data.frame( id = sample(1:5, 5, FALSE))
> table1
  id
1  2
2  5
3  4
4  3
5  1
> table2 <- data.frame( id = 1:5, var = rnorm(5))
> table2
  id        var
1  1  1.2724293
2  2  0.4146414
3  3 -1.5399500
4  4 -0.9285670
5  5 -0.2947204
> merge(table1, table2, sort = FALSE)
  id        var
1  2  0.4146414
2  5 -0.2947204
3  4 -0.9285670
4  3 -1.5399500
5  1  1.2724293
like image 63
Chase Avatar answered Feb 06 '26 03:02

Chase


Here is a data.table way of doing this, in case the data is big and speed is an issue. For more information, refer to the help page of ?data.table:

When i is a data.table, x (that is the outer data.table) must have a key. i (that is the inner data.table) is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge
of the two tables is carried out.

Note that I adjusted the sample data provided by Chase a little to make certain points about the matching in data.table more obvious:

require(data.table)
#Version 1.7.7
set.seed(1)
table1 <- data.table(id = sample(3:7, 5, FALSE), var1 = rnorm(5), key="id")
table2 <- data.table(id = 5:10, var2 = rnorm(6), key="id")

#Default: If id in table 1 is not in table 2, return NA
table2[table1]
#      id         var2       var1
# [1,]  3           NA -0.2947204
# [2,]  4           NA  1.2724293
# [3,]  5 -0.005767173 -0.9285670
# [4,]  6  2.404653389 -1.5399500
# [5,]  7  0.763593461  0.4146414

#If one wants to get rid of the NAs
table2[table1, nomatch=0]
#      id         var2       var1
# [1,]  5 -0.005767173 -0.9285670
# [2,]  6  2.404653389 -1.5399500
# [3,]  7  0.763593461  0.4146414

#Or the other way around: get all ids of table 2
table1[table2]
#      id       var1         var2
# [1,]  5 -0.9285670 -0.005767173
# [2,]  6 -1.5399500  2.404653389
# [3,]  7  0.4146414  0.763593461
# [4,]  8         NA -0.799009249
# [5,]  9         NA -1.147657009
# [6,] 10         NA -0.289461574

The obligatory speed test:

set.seed(10)
df1 <- data.frame(id = sample(1:5e6, 5e6, FALSE))
df2 <- data.frame(id = sample(1:5e6, 5e6, FALSE), var = rnorm(5e6))
system.time(df_solution <- merge(df1, df2, sort = TRUE))
#    user  system elapsed 
#   33.10    0.32   33.54
merge_dt <- function(df1, df2) {
  dt1 <- setkey(as.data.table(df1), "id")
  dt2 <- setkey(as.data.table(df2), "id")
  return(dt1[dt2])
}
system.time(dt_solution <- merge_dt(df1, df2))
#    user  system elapsed 
#   12.94    0.01   12.95 
all.equal(df_solution, as.data.frame(dt_solution))
#[1] TRUE

And my usual disclaimer: I'm still learning a lot about this package as well, so you find better information at the package homepage.

like image 35
Christoph_J Avatar answered Feb 06 '26 03:02

Christoph_J