I am trying to simuate a new dataset from two smaller datasets. It is important for me to maintain the marginal counts from these smaller datasets in the final dataset. Hopefully this reproducible example should explain what I mean.
library(data.table) # 1.10.5
set.seed(123)
meanVal <- 40
Here I simulate some age and gender data. Each location will always have 2 levels of gender and 100 levels of age.
demoDat <- CJ(with(CJ(letters,letters[1:5]), paste0(V1,V2)), c("M","F"), 0:99)
setnames(demoDat, c("Location","Gender","Age"))
demoDat[, Val := rpois(.N, meanVal)]
Location Gender Age Val
1: aa F 0 36
2: aa F 1 47
3: aa F 2 29
---
25998: ze M 97 45
25999: ze M 98 38
26000: ze M 99 39
This code simulates the temporal data dimension. In this case, the dates are spaced out by week, but the actual data do not have to adhere to this uniformity. Weeks can be missing.
timeDat <- with(demoDat, CJ(unique(Location), seq(from=as.Date("2016-01-01"),by=7,length.out = 52)))
setnames(timeDat, c("Location","Date"))
totals <- demoDat[, .(Val=sum(Val)), by=.(Location)]
timeDat[totals, Val := rmultinom(1:.N, i.Val, prob=rep(1,.N)), by=.EACHI,on=.(Location)]
Location Date Val
1: aa 2016-01-01 176
2: aa 2016-01-08 143
3: aa 2016-01-15 143
---
6758: ze 2016-12-09 165
6759: ze 2016-12-16 142
6760: ze 2016-12-23 156
Each location should a Val
column that totals to be the same in both the demoDat
and the timeDat
datasets.
timeDat[, sum(Val), by=.(Location)][order(-V1)][1:5]
# Location V1
# 1: jb 8229
# 2: xb 8223
# 3: ad 8179
# 4: nc 8176
# 5: gd 8173
demoDat[, sum(Val), by=.(Location)][order(-V1)][1:5]
# Location V1
# 1: jb 8229
# 2: xb 8223
# 3: ad 8179
# 4: nc 8176
# 5: gd 8173
Next, I want to create a dataset with Age
, Gender
, and Date
variables. But I need to maintain my marginal sums of Val
from the demoDat
and timeDat
datasets.
I have one strategy that accomplishes this task, but it takes up quite a bit of RAM. Is there another strategy I can employ that performs the expansion within each group at a time? Maybe using
.EACHI
?
This is the expensive part of the operation. The datasets are expanded so the number of rows is equal to the sum(Val)
. In cases where the sum(Val)
is > 500,000,000
, this can be expensive. Especially since the operation is repeated for a second dataset. I am hoping to use .EACHI
so only the data within groups are expanded, which should lower the memory footprint substantially.
library(pryr)
memUsed <- mem_used()
demoDatBig <- demoDat[rep(1:.N, Val), .(Location, Gender, Age, ID=rowid(Location))]
timeDatBig <- timeDat[rep(1:.N, Val), .(Location, Date, ID=rowid(Location))]
demoDatBig[timeDatBig, Date := i.Date, on=.(Location, ID)]
finalBigDat <- demoDatBig[, .(Val=.N), by=.(Location, Gender, Age, Date)]
mem_used() - memUsed
# 47 MB
So this operation took 47 MB of RAM, but if I increase the meanVal
, it increases significantly. I would like this to use as much RAM as this operation would take for the same function on the biggest Location
and ID
group. I think this is possible using .EACHI
, but I am not exactly sure how.
Location Gender Age Date Val
1: aa F 0 2016-01-01 36
2: aa F 1 2016-01-01 47
3: aa F 2 2016-01-01 29
4: aa F 3 2016-01-01 40
5: aa F 4 2016-01-01 24
---
32430: ze M 96 2016-12-16 7
32431: ze M 96 2016-12-23 34
32432: ze M 97 2016-12-23 45
32433: ze M 98 2016-12-23 38
32434: ze M 99 2016-12-23 39
#### Test 1
test1 <- finalBigDat[, .(Val = sum(Val)), by=.(Location, Gender, Age)]
test1[demoDat, ValCheck := i.Val, on=.(Location, Gender, Age)]
test1[Val != ValCheck]
#Empty data.table (0 rows) of 5 cols: Location,Gender,Age,Val,ValCheck
#### Test 2
test2 <- finalBigDat[, .(Val = sum(Val)), by=.(Location, Date)]
test2[timeDat, ValCheck := i.Val, on=.(Location, Date)]
test2[Val != ValCheck]
#Empty data.table (0 rows) of 4 cols: Location,Date,Val,ValCheck
I went through both solutions and tracked memory and system timings for both. Both solutions were amazing and are huge upgrades to what I currently have. @swihart's solution scales unbelieveable well to large meanVal
, so I chose this as the accepted answer. Heather's answer will help in situations when meanVal
is not quite as large. Both large and small meanVal
occur frequently so I will need both.
meanVal Ans Time Mem Rows
1: 40 Mike.Gahan 0.6245470 secs 44.54293 32434
2: 40 Heather Turner 0.6391492 secs 38.65355 1352000
3: 40 swihart 11.1602619 secs 66.97550 1352000
4: 400 Mike.Gahan 2.593275 secs 437.23832 32611
5: 400 Heather Turner 1.303993 secs 38.79871 1352000
6: 400 swihart 11.736836 secs 66.97550 1352000
7: 4000 Mike.Gahan 30.390986 secs 4364.51501 32629
8: 4000 Heather Turner 6.279249 secs 38.79871 1352000
9: 4000 swihart 11.427965 secs 66.97550 1352000
10: 20000 Mike.Gahan -------did not finish----------
11: 20000 Heather Turner 23.78948 secs 36.30617 1352000
12: 20000 swihart 11.53811 secs 66.97550 1352000
13: 30000 Mike.Gahan -------did not finish----------
14: 30000 Heather Turner 537.6459 secs 57.15375 1352000
15: 30000 swihart 11.970013 secs 66.97474 1352000
I ran your approach for different sizes of meanVal
, and saw the scaling issue for the approach of generating demoDatBig
and timeDatBig
. I have an approach (enclosed at bottom of this post) that generates cartDat
-- the cartesian cross of the dates and gender-age groups that is robust to increases in meanVal
and sum(Val)
, as seen in this table that lists the results of object.size()
for the data.tables being discussed:
| meanVal | sum(Val) | demoDatBig (MB) | timeDatBig (MB) | cartDat (MB) |
|----------|----------|------------------|------------------|---------------|
| 40 | 1e6 | 27.8 | 15.9 | 67.1 |
| 400 | 1e7 | 277.6 | 158.7 | 67.1 |
| 4,000 | 1e8 | 2,776.8 | 1,586.8 | 67.1 |
| 40,000 | 1e9 | 27,770.3 | 15,868.7 | 67.1 |
The key to my approach is to generate a cartesian cross between the unexpanded source data.tables demoDat
and timeDat
, and then use an "iterative multivariate hypergeometric sampling" (IMHS
) scheme to preserve the margins of both source data.tables. In order to have an R functionality for the IMHS
, I took from CRAN the R package BiasedUrn and recompiled it so that it could handle 52 colors (in our application, Dates). If the max number of Dates for a given location needs to be adjusted, let me know and I'll re-recompile. Thus, the R package BiasedUrn52 is on github.
My solution passes the test1
and test2
and preserves the marginals. However, it seems to distribute the gender-age marginal across more dates than OP procedure. Allow me to elaborate:
If we take the first 5 rows of timeDat
:
> head(demoDat,5)
Location Gender Age Val
1: aa F 0 36
2: aa F 1 47
3: aa F 2 29
4: aa F 3 40
5: aa F 4 50
And the first 6 of finalBigDat
:
> head(finalBigDat,6)
Location Gender Age Date Val
1: aa F 0 2016-01-01 36
2: aa F 1 2016-01-01 47
3: aa F 2 2016-01-01 29
4: aa F 3 2016-01-01 40
5: aa F 4 2016-01-01 24
6: aa F 4 2016-01-08 26
We see the entire 36 for the F-0 gender-age group was ascribed to 2016-01-01, whereas the 50 for the F-4 group was distributed across 2016-01-01 (24) and 2016-01-08 (26), but no other dates (50=24+26).
The IMHS
method distributes the marginals across many more dates (I'm not sure if this is desired or not -- please let me know). For instance, IMHS
took the 36 of the F-0 group and instead of putting all 36 onto 2016-01-01 as in finalBigDat
it spread them across more Dates (look at seq.Draws
):
> cartDat[Location=='aa' & Gender=="F" & Age==0,
+ c("Location", "Gender", "Age", "Date", "seq.Draws"),
+ with=FALSE]
Location Gender Age Date seq.Draws
1: aa F 0 2016-01-01 1
2: aa F 0 2016-01-08 0
3: aa F 0 2016-01-15 1
4: aa F 0 2016-01-22 1
5: aa F 0 2016-01-29 0
6: aa F 0 2016-02-05 0
7: aa F 0 2016-02-12 0
8: aa F 0 2016-02-19 0
9: aa F 0 2016-02-26 0
10: aa F 0 2016-03-04 0
11: aa F 0 2016-03-11 0
12: aa F 0 2016-03-18 0
13: aa F 0 2016-03-25 3
14: aa F 0 2016-04-01 1
15: aa F 0 2016-04-08 0
16: aa F 0 2016-04-15 0
17: aa F 0 2016-04-22 1
18: aa F 0 2016-04-29 1
19: aa F 0 2016-05-06 0
20: aa F 0 2016-05-13 2
21: aa F 0 2016-05-20 0
22: aa F 0 2016-05-27 0
23: aa F 0 2016-06-03 0
24: aa F 0 2016-06-10 0
25: aa F 0 2016-06-17 1
26: aa F 0 2016-06-24 2
27: aa F 0 2016-07-01 0
28: aa F 0 2016-07-08 0
29: aa F 0 2016-07-15 0
30: aa F 0 2016-07-22 1
31: aa F 0 2016-07-29 0
32: aa F 0 2016-08-05 1
33: aa F 0 2016-08-12 1
34: aa F 0 2016-08-19 1
35: aa F 0 2016-08-26 1
36: aa F 0 2016-09-02 1
37: aa F 0 2016-09-09 2
38: aa F 0 2016-09-16 0
39: aa F 0 2016-09-23 1
40: aa F 0 2016-09-30 0
41: aa F 0 2016-10-07 2
42: aa F 0 2016-10-14 3
43: aa F 0 2016-10-21 0
44: aa F 0 2016-10-28 1
45: aa F 0 2016-11-04 1
46: aa F 0 2016-11-11 1
47: aa F 0 2016-11-18 0
48: aa F 0 2016-11-25 0
49: aa F 0 2016-12-02 2
50: aa F 0 2016-12-09 1
51: aa F 0 2016-12-16 1
52: aa F 0 2016-12-23 1
The noting of the distributional differences between the OP approach and the IMHS
cartDat
approach is just an aside. The marginals are preserved, as seen below.
The marginals for timeDat
are preserved:
> cartDat[, sum(seq.Draws), by=.(Location, Date)]
Location Date V1
1: aa 2016-01-01 176
2: aa 2016-01-08 143
3: aa 2016-01-15 143
4: aa 2016-01-22 154
5: aa 2016-01-29 174
---
6756: ze 2016-11-25 169
6757: ze 2016-12-02 148
6758: ze 2016-12-09 165
6759: ze 2016-12-16 142
6760: ze 2016-12-23 156
> timeDat
Location Date Val
1: aa 2016-01-01 176
2: aa 2016-01-08 143
3: aa 2016-01-15 143
4: aa 2016-01-22 154
5: aa 2016-01-29 174
---
6756: ze 2016-11-25 169
6757: ze 2016-12-02 148
6758: ze 2016-12-09 165
6759: ze 2016-12-16 142
6760: ze 2016-12-23 156
as are the marginals for demoDat
:
> cartDat[, sum(seq.Draws), by=.(Location, Gender, Age)]
Location Gender Age V1
1: aa F 0 36
2: aa F 1 47
3: aa F 2 29
4: aa F 3 40
5: aa F 4 50
---
25996: ze M 95 48
25997: ze M 96 41
25998: ze M 97 45
25999: ze M 98 38
26000: ze M 99 39
> demoDat
Location Gender Age Val
1: aa F 0 36
2: aa F 1 47
3: aa F 2 29
4: aa F 3 40
5: aa F 4 50
---
25996: ze M 95 48
25997: ze M 96 41
25998: ze M 97 45
25999: ze M 98 38
26000: ze M 99 39
Here is the IMHS
cartDat
approach and some tests:
#Cartesian cross of demoDat and timeDat
devtools::install_github("swihart/BiasedUrn52")
library(BiasedUrn52)
setkey(timeDat, Location)
setkey(demoDat, Location, Gender, Age)
cartDat <- demoDat[timeDat, allow.cartesian=TRUE]
setkeyv(cartDat, key(demoDat))
cartDat
cartDat[,group:=.GRP,by=c("Gender", "Age") ]
cartDat[,demoDat.Val:=Val]
cartDat[,timeDat.Val:=i.Val]
setcolorder(cartDat, c("Location",
"group",
"Gender",
"Age",
"Val",
"demoDat.Val",
"Date",
"timeDat.Val",
"i.Val"))
#Define Iterative Multivariate Hypergeometric Sampling function
imhs <- function(.N, Val, i.Val, group){
grp.ind <- unique(group)
num.grp <- max(group)
grp.size <- as.numeric(table(group))
draws <- rep(NA, length(group))
for(grp in grp.ind){
if(grp==1){
draws[group==1] = rMFNCHypergeo(1,
i.Val[group==1],
Val[group==1][1],
rep(1/grp.size[grp.ind==1],grp.size[grp.ind==1])
)
i.Val[group==2]= i.Val[group==1]-draws[group==1]
}else{
draws[group==grp] = rMFNCHypergeo(1,
i.Val[group==grp],
Val[group==grp][1],
rep(1/grp.size[grp.ind==grp],grp.size[grp.ind==grp])
)
if(grp<=num.grp){
i.Val[group==(grp+1)]= i.Val[group==grp]-draws[group==grp]
}
}
}
list(i.Val, draws)
}
# run it the data.table way:
cartDat[,
c("seq.Val", "seq.Draws") := imhs(.N, demoDat.Val, timeDat.Val, group),
by=c("Location") ]
# take a look:
cartDat
# reconciliation
demoDat[, sum(Val), by=.(Location)][order(-V1)]
cartDat[, sum(seq.Draws), by=.(Location)][order(-V1)]
# do the checks for the margins:
cartDat[, sum(seq.Draws), by=.(Location, Date)]
timeDat
cartDat[, sum(seq.Draws), by=.(Location, Gender, Age)]
demoDat
# such different sizes due to distributing across more dates:
nrow(demoDat)
nrow(cartDat)
nrow(cartDat[seq.Draws != 0])
nrow(finalBigDat)
nrow(cartDat[seq.Draws != 0])/nrow(finalBigDat)
# attain and print object sizes for cartDat
print(object.size(cartDat), units = "Mb")
print(object.size(cartDat[seq.Draws!=0]), units="Mb")
# attain and print object sizes for demoDatBig, timeDatBig, finalBigData
print(object.size(demoDatBig), units = "Mb")
print(object.size(timeDatBig), units = "Mb")
print(object.size(finalBigDat), units = "Mb")
## (OP) The solution would pass these tests:
finalBigDat2 <- cartDat
#### Test 1 (change to sum(seq.Draws))
test1 <- finalBigDat2[, .(Val = sum(seq.Draws)), by=.(Location, Gender, Age)]
test1[demoDat, ValCheck := i.Val, on=.(Location, Gender, Age)]
test1[Val != ValCheck]
#Empty data.table (0 rows) of 5 cols: Location,Gender,Age,Val,ValCheck
#### Test 2 (change to sum(seq.Draws))
test2 <- finalBigDat2[, .(Val = sum(seq.Draws)), by=.(Location, Date)]
test2[timeDat, ValCheck := i.Val, on=.(Location, Date)]
test2[Val != ValCheck]
#Empty data.table (0 rows) of 4 cols: Location,Date,Val,ValCheck
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With