Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a stratified sample by state in R

How can I create a stratified sample in R using the "sampling" package? My dataset has 355,000 observations. The code works fine up to the last line. Below is the code I wrote, but I always get the following message: "Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?"

Please do not point me to older messages on Stackoverflow. I researched them, but have not been able to use them. Thank you.

## lpdata file has 355,000 observations
# Exclude Puerto Rico, Virgin Islands and Guam
sub.lpdata<-subset(lpdata,"STATE" != 'PR' | "STATE" != 'VI' | "STATE" != 'GU')

## Create a 10% sample, stratified by STATE
sort.lpdata<-sub.lpdata[order(sub.lpdata$STATE),]
tab.state<-data.frame(table(sort.lpdata$STATE))
size.strata<-as.vector(round(ceiling(tab.state$Freq)*0.1))

s<-strata(sort.lpdata,stratanames=sort.lpdata$STATE,size=size.strata,method="srswor")}
like image 281
vatodorov Avatar asked Mar 14 '12 14:03

vatodorov


People also ask

How do you create a strata sample?

To create a stratified random sample, there are seven steps: (a) defining the population; (b) choosing the relevant stratification; (c) listing the population; (d) listing the population according to the chosen stratification; (e) choosing your sample size; (f) calculating a proportionate stratification; and (g) using ...

How do you divide a population into a strata?

Steps for Stratified Sampling Divide your sample into strata depending on the relevant characteristic(s). Each strata must be mutually exclusive, but together, they must represent the entire population. Define the sample size for each stratum and decide whether your sample will be proportionate or disproportionate.

What is the formula in getting sample per strata?

The sample size for each strata (layer) is proportional to the size of the layer: Sample size of the strata = size of entire sample / population size * layer size.


1 Answers

I had to do something similar last year. If this is something you do a lot, you might want to use a function like the one below. This function lets you specify the name of the data frame you're sampling from, which variable is the ID variable, which is the strata, and if you want to use "set.seed". You can save the function as something like "stratified.R" and load it when you need to. See http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/

stratified = function(df, group, size) {
  #  USE: * Specify your data frame and grouping variable (as column 
  #         number) as the first two arguments.
  #       * Decide on your sample size. For a sample proportional to the
  #         population, enter "size" as a decimal. For an equal number 
  #         of samples from each group, enter "size" as a whole number.
  #
  #  Example 1: Sample 10% of each group from a data frame named "z",
  #             where the grouping variable is the fourth variable, use:
  # 
  #                 > stratified(z, 4, .1)
  #
  #  Example 2: Sample 5 observations from each group from a data frame
  #             named "z"; grouping variable is the third variable:
  #
  #                 > stratified(z, 3, 5)
  #
  require(sampling)
  temp = df[order(df[group]),]
  if (size < 1) {
    size = ceiling(table(temp[group]) * size)
  } else if (size >= 1) {
    size = rep(size, times=length(table(temp[group])))
  }  
  strat = strata(temp, stratanames = names(temp[group]), 
                 size = size, method = "srswor")
  (dsample = getdata(temp, strat))
}
like image 72
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 13 '22 10:10

A5C1D2H2I1M1N2O1R2T1