Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table and stratified means

Tags:

r

data.table

I've got some code that generate stratified weighted means and I'm certain this worked a few months ago. But, but I'm not sure what the current problem is. (I apologize - this must be very basic stuff):

dp=
structure(list(seqn = c(1L, 2L, 3L, 4L, 6L, 7L, 8L, 9L, 10L, 
11L, 12L, 13L, 3L, 4L, 9L, 10L, 11L, 14L, 8L, 11L, 12L, 10L, 
5L, 13L, 2L, 14L, 3L, 9L, 6L, 7L), sex = c(2L, 1L, 2L, 2L, 1L, 
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), bmi = c(22.8935608711259, 
27.0944623781918, 40.4637162938634, 23.7649712675423, 15.3193372705538, 
31.1280302540991, 21.4866354393239, 20.3200254374398, 32.331092513536, 
25.3679771839413, 33.9400508162971, 14.7048592172926, 25.5243757788688, 
23.4331882363495, 27.6428134168995, 29.3923629426172, 24.9547209666314, 
17.0522203606383, 15.51, 22, 30.62, 30.94, 29.1, 25.57, 24.9, 
27.33, 17.63, 18.48, 22.56, 29.39), tc = c(273L, 181L, 150L, 
201L, 142L, 165L, 235L, 219L, 298L, 222L, 143L, 134L, 268L, 160L, 
236L, 225L, 260L, 140L, 162L, 132L, 156L, 140L, 279L, 314L, 215L, 
174L, 129L, 148L, 153L, 245L), swt = c(1645, 3318, 2280, 1574, 
4062, 1627, 14604, 24675, 975, 975, 2697, 1559, 1737.58, 1730.23, 
19521.36, 28080.57, 1248.43, 13745.77, 5251.76464426326, 6497.194885522, 
15915.7023420765, 3740.96809540218, 16574.177622509, 307.32513798849, 
4720.89748295751, 3247.78896499604, 7698.70949077031, 1262.6450411464, 
6609.43340735515, 4254.23723479882)), .Names = c("seqn", "sex", 
"bmi", "tc", "swt"), row.names = c(20560L, 20561L, 20562L, 20563L, 
20565L, 20566L, 20567L, 20568L, 20569L, 20570L, 20571L, 20572L, 
61335L, 61336L, 61338L, 61339L, 61340L, 61341L, 95465L, 96890L, 
104613L, 105988L, 107581L, 112267L, 113403L, 114292L, 119979L, 
120271L, 125939L, 135699L), class = "data.frame")

dt=data.table(dp, key='sex')

sapply(df,function(x)weighted.mean(x,df$swt))  #this works to weighted mean
dt[,lapply(.SD, mean, na.rm=T), .SDcols=c('bmi','tc','swt')]  
     #this also works for overall unweighted mean

dt[,lapply(.SD, function(x)weighted.mean(x,swt, na.rm=TRUE)), by=key(dt), .SDcols=c('bmi','tc','swt')] 

but this gives the error: Error in weighted.mean.default(x, swt, na.rm = TRUE) : object 'swt' not found

sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.6

loaded via a namespace (and not attached):
[1] tools_2.15.2
like image 886
David F Avatar asked Nov 18 '12 16:11

David F


People also ask

What does stratified mean in data?

Stratification is defined as the act of sorting data, people, and objects into distinct groups or layers. It is a technique used in combination with other data analysis tools. When data from a variety of sources or categories have been lumped together, the meaning of the data can be difficult to see.

What is an example of stratified?

A stratified sample is one that ensures that subgroups (strata) of a given population are each adequately represented within the whole sample population of a research study. For example, one might divide a sample of adults into subgroups by age, like 18–29, 30–39, 40–49, 50–59, and 60 and above.

What does stratified sample mean in a survey?

Stratified random sampling is a method of sampling that involves the division of a population into smaller subgroups known as strata. In stratified random sampling, or stratification, the strata are formed based on members' shared attributes or characteristics, such as income or educational attainment.

What is the meaning of stratified analysis?

Stratification is used both to evaluate and control for confounding and requires separating your sample into subgroups, or strata, according to the confounder of interest (e.g., by age, gender, race/ethnicity, etc.).


1 Answers

UPDATE (from Arun): This is now fixed in v1.8.11. From NEWS:

o DT[, lapply(.SD, function(), by=] did not see columns of DT when optimisation is "on". This is now fixed, #2381. Tests added and tested successfully. Thanks to David F for reporting on SO: data.table and stratified means


This is indeed a bug introduced somewhere between 1.8.2 and 1.8.6.

dt[,lapply(.SD, function(x) weighted.mean(x,swt, na.rm=TRUE)), by=key(dt),
    .SDcols=c('bmi','tc','swt')] 
Error in weighted.mean.default(x, swt, na.rm = TRUE) : 
    object 'swt' not found

To work around this in the meantime, either turn off optimization :

options(datatable.optimize=FALSE)
dt[,lapply(.SD, function(x)weighted.mean(x,swt, na.rm=TRUE)), by=key(dt),    
    .SDcols=c('bmi','tc','swt')]
   sex      bmi       tc      swt
1:   1 25.64376 206.0115 17171.20
2:   2 23.73566 193.8727 11467.47

or, don't wrap with function() :

options(datatable.optimize=TRUE)
dt[,lapply(.SD, weighted.mean, swt, na.rm=TRUE), by=key(dt),    
    .SDcols=c('bmi','tc','swt')] 
   sex      bmi       tc      swt
1:   1 25.64376 206.0115 17171.20
2:   2 23.73566 193.8727 11467.47

We are making more use of optimization now, but this case slipped through the test suite: tests 825.1, 825.2 and 825.3 didn't cover an argument to a function being another column, within an anonymous function(). It would be a problem where the function isn't already given; i.e., unlike this case, where the function() can just be omitted since weighted.mean is already given and can be applied as-is.

You can see how optimization modifies j by setting verbose=TRUE (either per query or with the global option). In this case nothing would have been revealed as wrong by that verbose output, but just mentioning it as an aside.

Now filed as #2381: Optimization of lapply(.SD, function() ...) no longer sees columns inside .... Will fix and add tests so this can't regress again.

Thanks!

like image 183
Matt Dowle Avatar answered Oct 13 '22 00:10

Matt Dowle