I am looking for the best way to do a random stratified sampling like survey and polls. I don't want to do a sklearn.model_selection.StratifiedShuffleSplit since I am not doing a supervised learning and I have no target. I just want to create random stratified samples from pandas DataFrame (https://www.investopedia.com/terms/stratified_random_sampling.asp).
Python is my main language.
Thank you for any help
We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.
To create a stratified random sample, there are seven steps: (a) defining the population; (b) choosing the relevant stratification; (c) listing the population; (d) listing the population according to the chosen stratification; (e) choosing your sample size; (f) calculating a proportionate stratification; and (g) using ...
Stratified Train-Test Splits As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.
This is my best solution so far. It is important to bin continuous variables before and to have a minimum of observations for each stratum.
In this example, I am :
When comparing both samples, the stratified one is much more representative of the overall population.
If anyone has an idea of a more optimal way to do it, please feel free to share.
import pandas as pd
import numpy as np
# Generate random population (100K)
population = pd.DataFrame(index=range(0,100000))
population['income'] = 0
population['income'].iloc[39000:80000] = 1
population['income'].iloc[80000:] = 2
population['sex'] = np.random.randint(0,2,100000)
population['age'] = np.random.randint(0,4,100000)
pop_count = population.groupby(['income', 'sex', 'age'])['income'].count()
# Random sampling (100 observations out of 100k)
random_sample = population.iloc[
np.random.randint(
0,
len(population),
int(len(population) / 1000)
)
]
# Random Stratified Sampling (100 observations out of 100k)
stratified_sample = list(map(lambda x : population[
(
population['income'] == pop_count.index[x][0]
)
&
(
population['sex'] == pop_count.index[x][1]
)
&
(
population['age'] == pop_count.index[x][2]
)
].sample(frac=0.001), range(len(pop_count))))
stratified_sample = pd.concat(stratified_sample)
Given that the variables are binned, the following one liner should give you the desired output. I see that scikit-learn is mainly employed for purposes other than yours but using a function from it should not do any harm.
Note that if you have a scikit-learn version earlier than the 0.19.0, the sampling result might contain duplicate rows.
If you test the following method, please share whether it behaves as expected or not.
from sklearn.model_selection import train_test_split
stratified_sample, _ = train_test_split(population, test_size=0.999, stratify=population[['income', 'sex', 'age']])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With