How to merge two large datasets while generate new column with different repeat value in r

Tags:

I have a question which drives me crazy and really need your help. The simplified question is this:

d1<-data.table(v1=c("a","b","c","d","d","b","a","c","a","d","b","a"),
                    v2=(seq(1:12)),V3=rep(1:4,times=3))

d2<-data.table(v1=c("a","b","c","d"),v3=c(3,2,1,4),v4=c("y","x","t","e"))

This will yield two data sets:

    D1:     
    v1 v2 V3
 1:  a  1  1
 2:  b  2  2
 3:  c  3  3
 4:  d  4  4
 5:  d  5  1
 6:  b  6  2
 7:  a  7  3
 8:  c  8  4
 9:  a  9  1
10:  d 10  2
11:  b 11  3
12:  a 12  4

> d2
   v1 v3 v4
1:  a  3  y
2:  b  2  x
3:  c  1  t
4:  d  4  e

As you could see that the elements in v1 and v3 is the same. Now I want to joint both data set by creating a new column in the D1 which return the value of V4 in d2 that match both indices v1 and v3, I hope I could get output looking like this:

>

 d3
    v1 v2 V3 V4
 1:  a  1  1 na
 2:  b  2  2  x
 3:  c  3  3 na
 4:  d  4  4  e
 5:  d  5  1 na
 6:  b  6  2  x
 7:  a  7  3  y
 8:  c  8  4 na
 9:  a  9  1 na
10:  d 10  2 na
11:  b 11  3 na
12:  a 12  4 na

The size of actual data I am using is relatively very large. It is something like joint 113MB data with 23MB. I tried to use for loop to do this problem by because the data is so long, it takes ages to finish the task. I also tried mergeand sqldf but both of them failed to finish the job. Could you please help me with this problem? Thank you very much!

870

asked Nov 03 '14 17:11

sxgn

1 Answers

I'd do it like this:

setkey(d1, v1, V3) 
d1[d2, v4 := v4][]

For a join of the form x[i], key for x needs to be set. i may or may not have the key set. So we set the key for d1 here to columns v1 and V3.
Next, we perform a join d1[d2] which, for each row of d2 finds the rows that matches with the key columns of d1 and returns the join result. We're not looking for that result exactly. We'd rather like to add a new column where each matching row gets it's value from d2's v4 and otherwise NA. For this we make use of data.table's sub-assign by reference functionality. While joining i to x, we can still provide an expression in j, and refer to i's columns. You can also refer to them as i.v4 (usually used if there are columns of the same names in both x and i).
:= adds/updates a column by reference. The LHS of := is the column name we want to create here and the RHS v4 is the value we want to assign it from (here, it's the column from d2). For each matching row therefore, we assign d2's v4 onto d1's new column (which we name) v4 by reference (in-place, meaning no copy is made), and those rows with no matches will get the default value of NA.
The last [] is just to print the output to screen, as := returns the result invisibly.

Hope this helps to understand what's going on here.

answered Sep 20 '22 06:09

Arun

Related questions
                            
                                Circular Stacked Bar Plot in R
                            
                                How to combine multiple chains from rjags into one chain in R?
                            
                                Greek letters in ggplot annotate
                            
                                Difference between sum(), length(which()), and nrow() in R
                            
                                ggplot2: center legend below plot instead of panel area
                            
                                putting `mclapply` results back onto data.frame
                            
                                Print a web page from within R
                            
                                HTML outputs are different between using knitr in Rstudio & knit2html in command line
                            
                                Using Rcpp function in parLapply on Windows
                            
                                Stopping an R script without getting "Error during wrapup" message
                            
                                R Lattice like plots with Python, Pandas and Matplotlib
                            
                                How to use all features in rpart?
                            
                                Using dplyr summarise_each() with is.na()
                            
                                R read comma delimited txt file with comma inside one column
                            
                                Limit Output of Function in Rstudio (3.1.1) when Knitting to PDF
                            
                                Protect user credentials when connecting R with databases using JDBC/ODBC drivers
                            
                                How to combine state-level shapefiles from the united states census bureau into a nationwide shape
                            
                                How to restart a sequence based on values in another column OR reference the previous column's value in R
                            
                                knitr called from RStudio does not preserve the order in which packages are loaded
                            
                                What do ..1 and ..2 stand for in R? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to merge two large datasets while generate new column with different repeat value in r

Tags:

merge

r

data.table

sxgn

People also ask

1 Answers

Arun

Recent Activity

Donate For Us