Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simulating Data Efficiently with data.table

I am trying to simuate a new dataset from two smaller datasets. It is important for me to maintain the marginal counts from these smaller datasets in the final dataset. Hopefully this reproducible example should explain what I mean.

Build fake data

library(data.table) # 1.10.5
set.seed(123)
meanVal <- 40

demoDat

Here I simulate some age and gender data. Each location will always have 2 levels of gender and 100 levels of age.

demoDat <- CJ(with(CJ(letters,letters[1:5]), paste0(V1,V2)), c("M","F"), 0:99)
setnames(demoDat, c("Location","Gender","Age"))
demoDat[, Val := rpois(.N, meanVal)]


       Location Gender Age Val
    1:       aa      F   0  36
    2:       aa      F   1  47
    3:       aa      F   2  29
   ---                        
25998:       ze      M  97  45
25999:       ze      M  98  38
26000:       ze      M  99  39

timeDat

This code simulates the temporal data dimension. In this case, the dates are spaced out by week, but the actual data do not have to adhere to this uniformity. Weeks can be missing.

timeDat <- with(demoDat, CJ(unique(Location), seq(from=as.Date("2016-01-01"),by=7,length.out = 52)))
setnames(timeDat, c("Location","Date"))
totals <- demoDat[, .(Val=sum(Val)), by=.(Location)]
timeDat[totals, Val := rmultinom(1:.N, i.Val, prob=rep(1,.N)), by=.EACHI,on=.(Location)]

      Location       Date Val
   1:       aa 2016-01-01 176
   2:       aa 2016-01-08 143
   3:       aa 2016-01-15 143
  ---                        
6758:       ze 2016-12-09 165
6759:       ze 2016-12-16 142
6760:       ze 2016-12-23 156

Quick reconciliation

Each location should a Val column that totals to be the same in both the demoDat and the timeDat datasets.

timeDat[, sum(Val), by=.(Location)][order(-V1)][1:5]
#    Location   V1
# 1:       jb 8229
# 2:       xb 8223
# 3:       ad 8179
# 4:       nc 8176
# 5:       gd 8173
demoDat[, sum(Val), by=.(Location)][order(-V1)][1:5]
#    Location   V1
# 1:       jb 8229
# 2:       xb 8223
# 3:       ad 8179
# 4:       nc 8176
# 5:       gd 8173

Desired final dataset

Next, I want to create a dataset with Age, Gender, and Date variables. But I need to maintain my marginal sums of Val from the demoDat and timeDat datasets.

I have one strategy that accomplishes this task, but it takes up quite a bit of RAM. Is there another strategy I can employ that performs the expansion within each group at a time? Maybe using .EACHI?

Expand both datasets and merge

This is the expensive part of the operation. The datasets are expanded so the number of rows is equal to the sum(Val). In cases where the sum(Val) is > 500,000,000, this can be expensive. Especially since the operation is repeated for a second dataset. I am hoping to use .EACHI so only the data within groups are expanded, which should lower the memory footprint substantially.

library(pryr)
memUsed <- mem_used() 
demoDatBig <- demoDat[rep(1:.N, Val), .(Location, Gender, Age, ID=rowid(Location))]
timeDatBig <- timeDat[rep(1:.N, Val), .(Location, Date, ID=rowid(Location))]
demoDatBig[timeDatBig, Date := i.Date, on=.(Location, ID)]
finalBigDat <- demoDatBig[, .(Val=.N), by=.(Location, Gender, Age, Date)]
mem_used() - memUsed
# 47 MB

So this operation took 47 MB of RAM, but if I increase the meanVal, it increases significantly. I would like this to use as much RAM as this operation would take for the same function on the biggest Location and ID group. I think this is possible using .EACHI, but I am not exactly sure how.

Resulting data.table

       Location Gender Age       Date Val
    1:       aa      F   0 2016-01-01  36
    2:       aa      F   1 2016-01-01  47
    3:       aa      F   2 2016-01-01  29
    4:       aa      F   3 2016-01-01  40
    5:       aa      F   4 2016-01-01  24
   ---                                   
32430:       ze      M  96 2016-12-16   7
32431:       ze      M  96 2016-12-23  34
32432:       ze      M  97 2016-12-23  45
32433:       ze      M  98 2016-12-23  38
32434:       ze      M  99 2016-12-23  39

The solution would hopefully pass these tests

#### Test 1
test1 <- finalBigDat[, .(Val = sum(Val)), by=.(Location, Gender, Age)]
test1[demoDat, ValCheck := i.Val, on=.(Location, Gender, Age)]
test1[Val != ValCheck]
#Empty data.table (0 rows) of 5 cols: Location,Gender,Age,Val,ValCheck

#### Test 2
test2 <- finalBigDat[, .(Val = sum(Val)), by=.(Location, Date)]
test2[timeDat, ValCheck := i.Val, on=.(Location, Date)]
test2[Val != ValCheck]
#Empty data.table (0 rows) of 4 cols: Location,Date,Val,ValCheck

Results

I went through both solutions and tracked memory and system timings for both. Both solutions were amazing and are huge upgrades to what I currently have. @swihart's solution scales unbelieveable well to large meanVal, so I chose this as the accepted answer. Heather's answer will help in situations when meanVal is not quite as large. Both large and small meanVal occur frequently so I will need both.

   meanVal            Ans            Time      Mem    Rows
1:      40     Mike.Gahan  0.6245470 secs 44.54293   32434
2:      40 Heather Turner  0.6391492 secs 38.65355 1352000
3:      40        swihart 11.1602619 secs 66.97550 1352000
4:     400     Mike.Gahan  2.593275 secs 437.23832   32611
5:     400 Heather Turner  1.303993 secs  38.79871 1352000
6:     400        swihart 11.736836 secs  66.97550 1352000
7:    4000     Mike.Gahan 30.390986 secs 4364.51501   32629
8:    4000 Heather Turner  6.279249 secs   38.79871 1352000
9:    4000        swihart 11.427965 secs   66.97550 1352000
10:   20000     Mike.Gahan -------did not finish----------
11:   20000 Heather Turner 23.78948 secs 36.30617 1352000
12:   20000        swihart 11.53811 secs 66.97550 1352000
13:   30000     Mike.Gahan -------did not finish----------
14:   30000 Heather Turner 537.6459  secs 57.15375 1352000
15:   30000        swihart 11.970013 secs 66.97474 1352000
like image 203
Mike.Gahan Avatar asked Feb 09 '17 21:02

Mike.Gahan


1 Answers

I ran your approach for different sizes of meanVal, and saw the scaling issue for the approach of generating demoDatBig and timeDatBig. I have an approach (enclosed at bottom of this post) that generates cartDat -- the cartesian cross of the dates and gender-age groups that is robust to increases in meanVal and sum(Val), as seen in this table that lists the results of object.size() for the data.tables being discussed:

| meanVal  | sum(Val) | demoDatBig (MB)  | timeDatBig (MB)  | cartDat (MB)  |
|----------|----------|------------------|------------------|---------------|
|      40  |     1e6  |            27.8  |            15.9  |          67.1 |
|     400  |     1e7  |           277.6  |           158.7  |          67.1 |
|   4,000  |     1e8  |         2,776.8  |         1,586.8  |          67.1 |
|  40,000  |     1e9  |        27,770.3  |        15,868.7  |          67.1 |

The key to my approach is to generate a cartesian cross between the unexpanded source data.tables demoDat and timeDat, and then use an "iterative multivariate hypergeometric sampling" (IMHS) scheme to preserve the margins of both source data.tables. In order to have an R functionality for the IMHS, I took from CRAN the R package BiasedUrn and recompiled it so that it could handle 52 colors (in our application, Dates). If the max number of Dates for a given location needs to be adjusted, let me know and I'll re-recompile. Thus, the R package BiasedUrn52 is on github.

My solution passes the test1 and test2 and preserves the marginals. However, it seems to distribute the gender-age marginal across more dates than OP procedure. Allow me to elaborate:

If we take the first 5 rows of timeDat:

> head(demoDat,5)
   Location Gender Age Val
1:       aa      F   0  36
2:       aa      F   1  47
3:       aa      F   2  29
4:       aa      F   3  40
5:       aa      F   4  50

And the first 6 of finalBigDat:

> head(finalBigDat,6)
   Location Gender Age       Date Val
1:       aa      F   0 2016-01-01  36
2:       aa      F   1 2016-01-01  47
3:       aa      F   2 2016-01-01  29
4:       aa      F   3 2016-01-01  40
5:       aa      F   4 2016-01-01  24
6:       aa      F   4 2016-01-08  26

We see the entire 36 for the F-0 gender-age group was ascribed to 2016-01-01, whereas the 50 for the F-4 group was distributed across 2016-01-01 (24) and 2016-01-08 (26), but no other dates (50=24+26).

The IMHS method distributes the marginals across many more dates (I'm not sure if this is desired or not -- please let me know). For instance, IMHS took the 36 of the F-0 group and instead of putting all 36 onto 2016-01-01 as in finalBigDat it spread them across more Dates (look at seq.Draws):

> cartDat[Location=='aa' & Gender=="F" & Age==0,
+         c("Location", "Gender", "Age", "Date", "seq.Draws"),
+         with=FALSE]
    Location Gender Age       Date seq.Draws
 1:       aa      F   0 2016-01-01         1
 2:       aa      F   0 2016-01-08         0
 3:       aa      F   0 2016-01-15         1
 4:       aa      F   0 2016-01-22         1
 5:       aa      F   0 2016-01-29         0
 6:       aa      F   0 2016-02-05         0
 7:       aa      F   0 2016-02-12         0
 8:       aa      F   0 2016-02-19         0
 9:       aa      F   0 2016-02-26         0
10:       aa      F   0 2016-03-04         0
11:       aa      F   0 2016-03-11         0
12:       aa      F   0 2016-03-18         0
13:       aa      F   0 2016-03-25         3
14:       aa      F   0 2016-04-01         1
15:       aa      F   0 2016-04-08         0
16:       aa      F   0 2016-04-15         0
17:       aa      F   0 2016-04-22         1
18:       aa      F   0 2016-04-29         1
19:       aa      F   0 2016-05-06         0
20:       aa      F   0 2016-05-13         2
21:       aa      F   0 2016-05-20         0
22:       aa      F   0 2016-05-27         0
23:       aa      F   0 2016-06-03         0
24:       aa      F   0 2016-06-10         0
25:       aa      F   0 2016-06-17         1
26:       aa      F   0 2016-06-24         2
27:       aa      F   0 2016-07-01         0
28:       aa      F   0 2016-07-08         0
29:       aa      F   0 2016-07-15         0
30:       aa      F   0 2016-07-22         1
31:       aa      F   0 2016-07-29         0
32:       aa      F   0 2016-08-05         1
33:       aa      F   0 2016-08-12         1
34:       aa      F   0 2016-08-19         1
35:       aa      F   0 2016-08-26         1
36:       aa      F   0 2016-09-02         1
37:       aa      F   0 2016-09-09         2
38:       aa      F   0 2016-09-16         0
39:       aa      F   0 2016-09-23         1
40:       aa      F   0 2016-09-30         0
41:       aa      F   0 2016-10-07         2
42:       aa      F   0 2016-10-14         3
43:       aa      F   0 2016-10-21         0
44:       aa      F   0 2016-10-28         1
45:       aa      F   0 2016-11-04         1
46:       aa      F   0 2016-11-11         1
47:       aa      F   0 2016-11-18         0
48:       aa      F   0 2016-11-25         0
49:       aa      F   0 2016-12-02         2
50:       aa      F   0 2016-12-09         1
51:       aa      F   0 2016-12-16         1
52:       aa      F   0 2016-12-23         1

The noting of the distributional differences between the OP approach and the IMHS cartDat approach is just an aside. The marginals are preserved, as seen below.

The marginals for timeDat are preserved:

> cartDat[, sum(seq.Draws), by=.(Location, Date)]
      Location       Date  V1
   1:       aa 2016-01-01 176
   2:       aa 2016-01-08 143
   3:       aa 2016-01-15 143
   4:       aa 2016-01-22 154
   5:       aa 2016-01-29 174
  ---                        
6756:       ze 2016-11-25 169
6757:       ze 2016-12-02 148
6758:       ze 2016-12-09 165
6759:       ze 2016-12-16 142
6760:       ze 2016-12-23 156
> timeDat
      Location       Date Val
   1:       aa 2016-01-01 176
   2:       aa 2016-01-08 143
   3:       aa 2016-01-15 143
   4:       aa 2016-01-22 154
   5:       aa 2016-01-29 174
  ---                        
6756:       ze 2016-11-25 169
6757:       ze 2016-12-02 148
6758:       ze 2016-12-09 165
6759:       ze 2016-12-16 142
6760:       ze 2016-12-23 156

as are the marginals for demoDat:

> cartDat[, sum(seq.Draws), by=.(Location, Gender, Age)]
       Location Gender Age V1
    1:       aa      F   0 36
    2:       aa      F   1 47
    3:       aa      F   2 29
    4:       aa      F   3 40
    5:       aa      F   4 50
   ---                       
25996:       ze      M  95 48
25997:       ze      M  96 41
25998:       ze      M  97 45
25999:       ze      M  98 38
26000:       ze      M  99 39
> demoDat
       Location Gender Age Val
    1:       aa      F   0  36
    2:       aa      F   1  47
    3:       aa      F   2  29
    4:       aa      F   3  40
    5:       aa      F   4  50
   ---                        
25996:       ze      M  95  48
25997:       ze      M  96  41
25998:       ze      M  97  45
25999:       ze      M  98  38
26000:       ze      M  99  39

Here is the IMHS cartDat approach and some tests:

#Cartesian cross of demoDat and timeDat
devtools::install_github("swihart/BiasedUrn52")
library(BiasedUrn52)
setkey(timeDat, Location)
setkey(demoDat, Location, Gender, Age)
cartDat <- demoDat[timeDat, allow.cartesian=TRUE]
setkeyv(cartDat, key(demoDat))
cartDat
cartDat[,group:=.GRP,by=c("Gender", "Age") ]
cartDat[,demoDat.Val:=Val]
cartDat[,timeDat.Val:=i.Val]
setcolorder(cartDat, c("Location", 
                       "group",
                       "Gender",
                       "Age",
                       "Val",
                       "demoDat.Val",
                       "Date",
                       "timeDat.Val",
                       "i.Val"))

#Define Iterative Multivariate Hypergeometric Sampling function
imhs <- function(.N, Val, i.Val, group){

  grp.ind <- unique(group)
  num.grp <- max(group)
  grp.size <- as.numeric(table(group))

  draws <- rep(NA, length(group))
  for(grp in grp.ind){

    if(grp==1){
      draws[group==1] = rMFNCHypergeo(1, 
                                      i.Val[group==1], 
                                      Val[group==1][1], 
                                      rep(1/grp.size[grp.ind==1],grp.size[grp.ind==1])
      )
      i.Val[group==2]= i.Val[group==1]-draws[group==1]
    }else{
      draws[group==grp] = rMFNCHypergeo(1, 
                                        i.Val[group==grp], 
                                        Val[group==grp][1], 
                                        rep(1/grp.size[grp.ind==grp],grp.size[grp.ind==grp])
      )
      if(grp<=num.grp){
        i.Val[group==(grp+1)]= i.Val[group==grp]-draws[group==grp]
      }
    }

  }

  list(i.Val, draws)
}


# run it the data.table way:
cartDat[,
        c("seq.Val", "seq.Draws") := imhs(.N, demoDat.Val, timeDat.Val, group),        
        by=c("Location") ]

# take a look:
cartDat

# reconciliation
demoDat[, sum(Val), by=.(Location)][order(-V1)]
cartDat[, sum(seq.Draws), by=.(Location)][order(-V1)]

# do the checks for the margins:
cartDat[, sum(seq.Draws), by=.(Location, Date)]
timeDat
cartDat[, sum(seq.Draws), by=.(Location, Gender, Age)]
demoDat


# such different sizes due to distributing across more dates:
nrow(demoDat)
nrow(cartDat)
nrow(cartDat[seq.Draws != 0])
nrow(finalBigDat)
nrow(cartDat[seq.Draws != 0])/nrow(finalBigDat)

# attain and print object sizes for cartDat
print(object.size(cartDat), units = "Mb")
print(object.size(cartDat[seq.Draws!=0]), units="Mb")

# attain and print object sizes for demoDatBig, timeDatBig, finalBigData
print(object.size(demoDatBig), units = "Mb")
print(object.size(timeDatBig), units = "Mb")
print(object.size(finalBigDat), units = "Mb")



## (OP) The solution would pass these tests:
finalBigDat2 <- cartDat

#### Test 1 (change to sum(seq.Draws))
test1 <- finalBigDat2[, .(Val = sum(seq.Draws)), by=.(Location, Gender, Age)]
test1[demoDat, ValCheck := i.Val, on=.(Location, Gender, Age)]
test1[Val != ValCheck]
#Empty data.table (0 rows) of 5 cols: Location,Gender,Age,Val,ValCheck

#### Test 2 (change to sum(seq.Draws))
test2 <- finalBigDat2[, .(Val = sum(seq.Draws)), by=.(Location, Date)]
test2[timeDat, ValCheck := i.Val, on=.(Location, Date)]
test2[Val != ValCheck]
#Empty data.table (0 rows) of 4 cols: Location,Date,Val,ValCheck
like image 103
swihart Avatar answered Oct 31 '22 21:10

swihart