Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R merge a particular column from one data frame to another according to its reference column

Tags:

merge

dataframe

r

I am a beginner R user. I have two huge dataframes and I want to add new column called Vaccine at hkdata.2, which the data is taken from another DF adherence according to 2 reference columns at hkdata.2 (hhID and member), could someone please help me?

hkdata.2
hhID    member  T0  delta   X_hh    X_fm    ILI age
1          1    7      0    0       0        0  44
1          2    7      0    0       0        0  36
2          1    8      0    1       0        0  39
2          2    8      0    1       0        0  39

adherence
hhID member mask soap vaccine
1      0      1    0    1   
1      1      1    1    1
1      2      0    0    1
2      0      1    0    0
2      1      0    0    0
2      2      1    0    1

So at the end I can get something like this. With an extra column called vaccine in hkdata.2

hkdata.2
hhID    member  T0  delta   X_hh    X_fm    ILI age vaccine
1          1    7      0    0       0        0  44    1
1          2    7      0    0       0        0  36    1
2          1    8      0    1       0        0  39    0
2          2    8      0    1       0        0  39    1
like image 813
londwwq1 Avatar asked Nov 16 '25 08:11

londwwq1


2 Answers

Update: Using v1.9.6 for on= syntax. See history for old code.

require(data.table) # v1.9.6+
setDT(hkdata.2)[setDT(adherence), vaccine := i.vaccine, on=c("hhID", "member")]
#    hhID member T0 delta X_hh X_fm ILI age vaccine
# 1:    1      1  7     0    0    0   0  44       1
# 2:    1      2  7     0    0    0   0  36       1
# 3:    2      1  8     0    1    0   0  39       0
# 4:    2      2  8     0    1    0   0  39       1
  1. setDT converts a data.frame to data.table by reference.

  2. Perform the join on columns specified by on=. Note that this join is both a) fast *and* b) memory efficient. a) *fast* because they're binary search based joins, and no copy is being made here at all. Thevaccinecolumn is directly added to yourhkdata.2data.table. b) *memory efficient* because only the columnvaccine` that you require is used in the join, and not the other columns (which is particularly great to have on really large datasets).


Here's a benchmark assuming 100,000 hhIDs and 200 members within each hhID:

require(data.table) # v1.9.6
require(dplyr) # 0.4.3.9000

set.seed(98192L)
N = 40e6 # 40 million rows
hkdata.2 = data.frame(hhID   = rep(1:1e5, each=200), 
                      member = 1:200,
                      T0     = sample(10),
                      delta  = sample(0:1), 
                      X_hh   = sample(0:1), 
                      X_fm   = sample(0:1), 
                      ILI    = sample(0:1),
                      age    = sample(30:100, N/2, TRUE))

# let's go with 100,000 hhIDs and 400 members here:
adherence = data.frame(hhID    = rep(1:1e5, each=400), 
                       member  = 1:400,
                       mask    = sample(0:1),
                       soap    = sample(0:1),
                       vaccine = sample(0:1))

## dplyr timing
system.time(ans1 <- left_join(hkdata.2, select(adherence, -soap, -mask)))
#   user  system elapsed
# 16.977   2.163  19.605
## data.table timing
system.time(setDT(hkdata.2)[setDT(adherence), vaccine := i.vaccine, on=c("hhID", "member")])
#   user  system elapsed
#  1.186   0.233   1.427

The peak memory usage of dplyr was 4.7GB and took 19.6 seconds to complete, whereas data.table took 1.4 seconds with peak memory usage of 2.2GB.

Summary: data.table is ~14x faster and ~2.1x memory efficient here.

like image 122
Arun Avatar answered Nov 17 '25 20:11

Arun


 library(dplyr)
 left_join(hkdata.2, adherence)
 #    Joining by: c("hhID", "member")
 #  hhID member T0 delta X_hh X_fm ILI age mask soap vaccine
 #1    1      1  7     0    0    0   0  44    1    1       1
 #2    1      2  7     0    0    0   0  36    0    0       1
 #3    2      1  8     0    1    0   0  39    0    0       0
 #4    2      2  8     0    1    0   0  39    1    0       1

If you don't need mask, soap

  left_join(hkdata.2, adherence) %>% select(-soap, -mask)

Or

  left_join(hkdata.2, adherence[,c("hhID", "member", "vaccine")])
like image 44
akrun Avatar answered Nov 17 '25 20:11

akrun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!