Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge two large datasets while generate new column with different repeat value in r

I have a question which drives me crazy and really need your help. The simplified question is this:

d1<-data.table(v1=c("a","b","c","d","d","b","a","c","a","d","b","a"),
                    v2=(seq(1:12)),V3=rep(1:4,times=3))

d2<-data.table(v1=c("a","b","c","d"),v3=c(3,2,1,4),v4=c("y","x","t","e"))

This will yield two data sets:

    D1:     
    v1 v2 V3
 1:  a  1  1
 2:  b  2  2
 3:  c  3  3
 4:  d  4  4
 5:  d  5  1
 6:  b  6  2
 7:  a  7  3
 8:  c  8  4
 9:  a  9  1
10:  d 10  2
11:  b 11  3
12:  a 12  4

> d2
   v1 v3 v4
1:  a  3  y
2:  b  2  x
3:  c  1  t
4:  d  4  e

As you could see that the elements in v1 and v3 is the same. Now I want to joint both data set by creating a new column in the D1 which return the value of V4 in d2 that match both indices v1 and v3, I hope I could get output looking like this:

>

 d3
    v1 v2 V3 V4
 1:  a  1  1 na
 2:  b  2  2  x
 3:  c  3  3 na
 4:  d  4  4  e
 5:  d  5  1 na
 6:  b  6  2  x
 7:  a  7  3  y
 8:  c  8  4 na
 9:  a  9  1 na
10:  d 10  2 na
11:  b 11  3 na
12:  a 12  4 na

The size of actual data I am using is relatively very large. It is something like joint 113MB data with 23MB. I tried to use for loop to do this problem by because the data is so long, it takes ages to finish the task. I also tried mergeand sqldf but both of them failed to finish the job. Could you please help me with this problem? Thank you very much!

like image 870
sxgn Avatar asked Nov 03 '14 17:11

sxgn


People also ask

How do I combine two datasets in R with different columns?

Method 1 : Using plyr package rbind. fill() method in R is an enhancement of the rbind() method in base R, is used to combine data frames with different columns. The column names are number may be different in the input data frames. Missing columns of the corresponding data frames are filled with NA.

How do I merge two datasets with common variable in R?

In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens.

Can you merge more than 2 datasets in R?

The merge function in R allows you to combine two data frames, much like the join function that is used in SQL to combine data tables. Merge , however, does not allow for more than two data frames to be joined at once, requiring several lines of code to join multiple data frames.


1 Answers

I'd do it like this:

setkey(d1, v1, V3) 
d1[d2, v4 := v4][]
  • For a join of the form x[i], key for x needs to be set. i may or may not have the key set. So we set the key for d1 here to columns v1 and V3.

  • Next, we perform a join d1[d2] which, for each row of d2 finds the rows that matches with the key columns of d1 and returns the join result. We're not looking for that result exactly. We'd rather like to add a new column where each matching row gets it's value from d2's v4 and otherwise NA. For this we make use of data.table's sub-assign by reference functionality. While joining i to x, we can still provide an expression in j, and refer to i's columns. You can also refer to them as i.v4 (usually used if there are columns of the same names in both x and i).

  • := adds/updates a column by reference. The LHS of := is the column name we want to create here and the RHS v4 is the value we want to assign it from (here, it's the column from d2). For each matching row therefore, we assign d2's v4 onto d1's new column (which we name) v4 by reference (in-place, meaning no copy is made), and those rows with no matches will get the default value of NA.

  • The last [] is just to print the output to screen, as := returns the result invisibly.

Hope this helps to understand what's going on here.

like image 89
Arun Avatar answered Sep 20 '22 06:09

Arun