I am a beginner R user. I have two huge dataframes and I want to add new column called Vaccine at hkdata.2, which the data is taken from another DF adherence according to 2 reference columns at hkdata.2 (hhID and member), could someone please help me?
hkdata.2
hhID member T0 delta X_hh X_fm ILI age
1 1 7 0 0 0 0 44
1 2 7 0 0 0 0 36
2 1 8 0 1 0 0 39
2 2 8 0 1 0 0 39
adherence
hhID member mask soap vaccine
1 0 1 0 1
1 1 1 1 1
1 2 0 0 1
2 0 1 0 0
2 1 0 0 0
2 2 1 0 1
So at the end I can get something like this. With an extra column called vaccine in hkdata.2
hkdata.2
hhID member T0 delta X_hh X_fm ILI age vaccine
1 1 7 0 0 0 0 44 1
1 2 7 0 0 0 0 36 1
2 1 8 0 1 0 0 39 0
2 2 8 0 1 0 0 39 1
Update: Using v1.9.6 for on= syntax. See history for old code.
require(data.table) # v1.9.6+
setDT(hkdata.2)[setDT(adherence), vaccine := i.vaccine, on=c("hhID", "member")]
# hhID member T0 delta X_hh X_fm ILI age vaccine
# 1: 1 1 7 0 0 0 0 44 1
# 2: 1 2 7 0 0 0 0 36 1
# 3: 2 1 8 0 1 0 0 39 0
# 4: 2 2 8 0 1 0 0 39 1
setDT converts a data.frame to data.table by reference.
Perform the join on columns specified by on=. Note that this join is both a) fast *and* b) memory efficient. a) *fast* because they're binary search based joins, and no copy is being made here at all. Thevaccinecolumn is directly added to yourhkdata.2data.table. b) *memory efficient* because only the columnvaccine` that you require is used in the join, and not the other columns (which is particularly great to have on really large datasets).
Here's a benchmark assuming 100,000 hhIDs and 200 members within each hhID:
require(data.table) # v1.9.6
require(dplyr) # 0.4.3.9000
set.seed(98192L)
N = 40e6 # 40 million rows
hkdata.2 = data.frame(hhID = rep(1:1e5, each=200),
member = 1:200,
T0 = sample(10),
delta = sample(0:1),
X_hh = sample(0:1),
X_fm = sample(0:1),
ILI = sample(0:1),
age = sample(30:100, N/2, TRUE))
# let's go with 100,000 hhIDs and 400 members here:
adherence = data.frame(hhID = rep(1:1e5, each=400),
member = 1:400,
mask = sample(0:1),
soap = sample(0:1),
vaccine = sample(0:1))
## dplyr timing
system.time(ans1 <- left_join(hkdata.2, select(adherence, -soap, -mask)))
# user system elapsed
# 16.977 2.163 19.605
## data.table timing
system.time(setDT(hkdata.2)[setDT(adherence), vaccine := i.vaccine, on=c("hhID", "member")])
# user system elapsed
# 1.186 0.233 1.427
The peak memory usage of dplyr was 4.7GB and took 19.6 seconds to complete, whereas data.table took 1.4 seconds with peak memory usage of 2.2GB.
Summary:
data.tableis ~14x faster and ~2.1x memory efficient here.
library(dplyr)
left_join(hkdata.2, adherence)
# Joining by: c("hhID", "member")
# hhID member T0 delta X_hh X_fm ILI age mask soap vaccine
#1 1 1 7 0 0 0 0 44 1 1 1
#2 1 2 7 0 0 0 0 36 0 0 1
#3 2 1 8 0 1 0 0 39 0 0 0
#4 2 2 8 0 1 0 0 39 1 0 1
If you don't need mask, soap
left_join(hkdata.2, adherence) %>% select(-soap, -mask)
Or
left_join(hkdata.2, adherence[,c("hhID", "member", "vaccine")])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With