Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do a random stratified sampling with Python (Not a train/test split)?

I am looking for the best way to do a random stratified sampling like survey and polls. I don't want to do a sklearn.model_selection.StratifiedShuffleSplit since I am not doing a supervised learning and I have no target. I just want to create random stratified samples from pandas DataFrame (https://www.investopedia.com/terms/stratified_random_sampling.asp).

Python is my main language.

Thank you for any help

like image 613
asl Avatar asked May 06 '18 00:05

asl


People also ask

How do you do a stratified split in Python?

We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.

How do you create a stratified random sample?

To create a stratified random sample, there are seven steps: (a) defining the population; (b) choosing the relevant stratification; (c) listing the population; (d) listing the population according to the chosen stratification; (e) choosing your sample size; (f) calculating a proportionate stratification; and (g) using ...

Should you stratify train test split?

Stratified Train-Test Splits As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.


2 Answers

This is my best solution so far. It is important to bin continuous variables before and to have a minimum of observations for each stratum.

In this example, I am :

  1. Generating a population
  2. Sampling in a pure random way
  3. Sampling in a random stratified way

When comparing both samples, the stratified one is much more representative of the overall population.

If anyone has an idea of a more optimal way to do it, please feel free to share.


import pandas as pd
import numpy as np

# Generate random population (100K)

population = pd.DataFrame(index=range(0,100000))
population['income'] = 0
population['income'].iloc[39000:80000] = 1
population['income'].iloc[80000:] = 2
population['sex'] = np.random.randint(0,2,100000)
population['age'] = np.random.randint(0,4,100000)

pop_count = population.groupby(['income', 'sex', 'age'])['income'].count()

# Random sampling (100 observations out of 100k)

random_sample = population.iloc[
    np.random.randint(
        0, 
        len(population), 
        int(len(population) / 1000)
    )
]

# Random Stratified Sampling (100 observations out of 100k)

stratified_sample = list(map(lambda x : population[
    (
        population['income'] == pop_count.index[x][0]
    ) 
    &
    (
        population['sex'] == pop_count.index[x][1]
    )
    &
    (
        population['age'] == pop_count.index[x][2]
    )
].sample(frac=0.001), range(len(pop_count))))

stratified_sample = pd.concat(stratified_sample)
like image 59
asl Avatar answered Oct 12 '22 17:10

asl


Given that the variables are binned, the following one liner should give you the desired output. I see that scikit-learn is mainly employed for purposes other than yours but using a function from it should not do any harm.

Note that if you have a scikit-learn version earlier than the 0.19.0, the sampling result might contain duplicate rows.

If you test the following method, please share whether it behaves as expected or not.

from sklearn.model_selection import train_test_split

stratified_sample, _ = train_test_split(population, test_size=0.999, stratify=population[['income', 'sex', 'age']])
like image 45
Furkan Gursoy Avatar answered Oct 12 '22 16:10

Furkan Gursoy