I've got some code that generate stratified weighted means and I'm certain this worked a few months ago. But, but I'm not sure what the current problem is. (I apologize - this must be very basic stuff):
dp=
structure(list(seqn = c(1L, 2L, 3L, 4L, 6L, 7L, 8L, 9L, 10L,
11L, 12L, 13L, 3L, 4L, 9L, 10L, 11L, 14L, 8L, 11L, 12L, 10L,
5L, 13L, 2L, 14L, 3L, 9L, 6L, 7L), sex = c(2L, 1L, 2L, 2L, 1L,
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), bmi = c(22.8935608711259,
27.0944623781918, 40.4637162938634, 23.7649712675423, 15.3193372705538,
31.1280302540991, 21.4866354393239, 20.3200254374398, 32.331092513536,
25.3679771839413, 33.9400508162971, 14.7048592172926, 25.5243757788688,
23.4331882363495, 27.6428134168995, 29.3923629426172, 24.9547209666314,
17.0522203606383, 15.51, 22, 30.62, 30.94, 29.1, 25.57, 24.9,
27.33, 17.63, 18.48, 22.56, 29.39), tc = c(273L, 181L, 150L,
201L, 142L, 165L, 235L, 219L, 298L, 222L, 143L, 134L, 268L, 160L,
236L, 225L, 260L, 140L, 162L, 132L, 156L, 140L, 279L, 314L, 215L,
174L, 129L, 148L, 153L, 245L), swt = c(1645, 3318, 2280, 1574,
4062, 1627, 14604, 24675, 975, 975, 2697, 1559, 1737.58, 1730.23,
19521.36, 28080.57, 1248.43, 13745.77, 5251.76464426326, 6497.194885522,
15915.7023420765, 3740.96809540218, 16574.177622509, 307.32513798849,
4720.89748295751, 3247.78896499604, 7698.70949077031, 1262.6450411464,
6609.43340735515, 4254.23723479882)), .Names = c("seqn", "sex",
"bmi", "tc", "swt"), row.names = c(20560L, 20561L, 20562L, 20563L,
20565L, 20566L, 20567L, 20568L, 20569L, 20570L, 20571L, 20572L,
61335L, 61336L, 61338L, 61339L, 61340L, 61341L, 95465L, 96890L,
104613L, 105988L, 107581L, 112267L, 113403L, 114292L, 119979L,
120271L, 125939L, 135699L), class = "data.frame")
dt=data.table(dp, key='sex')
sapply(df,function(x)weighted.mean(x,df$swt)) #this works to weighted mean
dt[,lapply(.SD, mean, na.rm=T), .SDcols=c('bmi','tc','swt')]
#this also works for overall unweighted mean
dt[,lapply(.SD, function(x)weighted.mean(x,swt, na.rm=TRUE)), by=key(dt), .SDcols=c('bmi','tc','swt')]
but this gives the error:
Error in weighted.mean.default(x, swt, na.rm = TRUE) : object 'swt' not found
sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.6
loaded via a namespace (and not attached):
[1] tools_2.15.2
Stratification is defined as the act of sorting data, people, and objects into distinct groups or layers. It is a technique used in combination with other data analysis tools. When data from a variety of sources or categories have been lumped together, the meaning of the data can be difficult to see.
A stratified sample is one that ensures that subgroups (strata) of a given population are each adequately represented within the whole sample population of a research study. For example, one might divide a sample of adults into subgroups by age, like 18–29, 30–39, 40–49, 50–59, and 60 and above.
Stratified random sampling is a method of sampling that involves the division of a population into smaller subgroups known as strata. In stratified random sampling, or stratification, the strata are formed based on members' shared attributes or characteristics, such as income or educational attainment.
Stratification is used both to evaluate and control for confounding and requires separating your sample into subgroups, or strata, according to the confounder of interest (e.g., by age, gender, race/ethnicity, etc.).
o
DT[, lapply(.SD, function(), by=]
did not see columns of DT when optimisation is "on". This is now fixed, #2381. Tests added and tested successfully. Thanks to David F for reporting on SO: data.table and stratified means
This is indeed a bug introduced somewhere between 1.8.2 and 1.8.6.
dt[,lapply(.SD, function(x) weighted.mean(x,swt, na.rm=TRUE)), by=key(dt),
.SDcols=c('bmi','tc','swt')]
Error in weighted.mean.default(x, swt, na.rm = TRUE) :
object 'swt' not found
To work around this in the meantime, either turn off optimization :
options(datatable.optimize=FALSE)
dt[,lapply(.SD, function(x)weighted.mean(x,swt, na.rm=TRUE)), by=key(dt),
.SDcols=c('bmi','tc','swt')]
sex bmi tc swt
1: 1 25.64376 206.0115 17171.20
2: 2 23.73566 193.8727 11467.47
or, don't wrap with function()
:
options(datatable.optimize=TRUE)
dt[,lapply(.SD, weighted.mean, swt, na.rm=TRUE), by=key(dt),
.SDcols=c('bmi','tc','swt')]
sex bmi tc swt
1: 1 25.64376 206.0115 17171.20
2: 2 23.73566 193.8727 11467.47
We are making more use of optimization now, but this case slipped through the test suite: tests 825.1, 825.2 and 825.3 didn't cover an argument to a function being another column, within an anonymous function()
. It would be a problem where the function isn't already given; i.e., unlike this case, where the function()
can just be omitted since weighted.mean
is already given and can be applied as-is.
You can see how optimization modifies j by setting verbose=TRUE
(either per query or with the global option). In this case nothing would have been revealed as wrong by that verbose output, but just mentioning it as an aside.
Now filed as #2381: Optimization of lapply(.SD, function() ...) no longer sees columns inside .... Will fix and add tests so this can't regress again.
Thanks!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With