How to do a random stratified sampling with Python (Not a train/test split)?

Tags:

I am looking for the best way to do a random stratified sampling like survey and polls. I don't want to do a sklearn.model_selection.StratifiedShuffleSplit since I am not doing a supervised learning and I have no target. I just want to create random stratified samples from pandas DataFrame (https://www.investopedia.com/terms/stratified_random_sampling.asp).

Python is my main language.

Thank you for any help

613

asked May 06 '18 00:05

asl

2 Answers

This is my best solution so far. It is important to bin continuous variables before and to have a minimum of observations for each stratum.

In this example, I am :

Generating a population
Sampling in a pure random way
Sampling in a random stratified way

When comparing both samples, the stratified one is much more representative of the overall population.

If anyone has an idea of a more optimal way to do it, please feel free to share.

import pandas as pd
import numpy as np

# Generate random population (100K)

population = pd.DataFrame(index=range(0,100000))
population['income'] = 0
population['income'].iloc[39000:80000] = 1
population['income'].iloc[80000:] = 2
population['sex'] = np.random.randint(0,2,100000)
population['age'] = np.random.randint(0,4,100000)

pop_count = population.groupby(['income', 'sex', 'age'])['income'].count()

# Random sampling (100 observations out of 100k)

random_sample = population.iloc[
    np.random.randint(
        0, 
        len(population), 
        int(len(population) / 1000)
    )
]

# Random Stratified Sampling (100 observations out of 100k)

stratified_sample = list(map(lambda x : population[
    (
        population['income'] == pop_count.index[x][0]
    ) 
    &
    (
        population['sex'] == pop_count.index[x][1]
    )
    &
    (
        population['age'] == pop_count.index[x][2]
    )
].sample(frac=0.001), range(len(pop_count))))

stratified_sample = pd.concat(stratified_sample)

answered Oct 12 '22 17:10

asl

Given that the variables are binned, the following one liner should give you the desired output. I see that scikit-learn is mainly employed for purposes other than yours but using a function from it should not do any harm.

Note that if you have a scikit-learn version earlier than the 0.19.0, the sampling result might contain duplicate rows.

If you test the following method, please share whether it behaves as expected or not.

from sklearn.model_selection import train_test_split

stratified_sample, _ = train_test_split(population, test_size=0.999, stratify=population[['income', 'sex', 'age']])

answered Oct 12 '22 16:10

Furkan Gursoy

Related questions
                            
                                How do I check for correlation using Decimal numbers/data with python 3
                            
                                Automatically detect test coupling
                            
                                Create a custom Tensorflow histogram summary
                            
                                How to append data to TensorFlow tfrecords file
                            
                                Getting the population of a city given its name
                            
                                Produce a composed image with different sized layers with transparency
                            
                                Plotly+Python: How to plot arrows in 3D?
                            
                                Mean average precision (mAP) in tensorflow
                            
                                Python decorators vs passing functions
                            
                                python plotly create a color scale related to max and min number of value
                            
                                Why are some numpy datatypes JSON serializable and others not?
                            
                                How to write a GRPC python unittest
                            
                                Python fractal box count - fractal dimension
                            
                                what does endpoint mean in flask-restful
                            
                                Is it possible to skip delegating a celery task if the params and the task name is already queued in the server?
                            
                                pandas assert_frame_equal behavior
                            
                                Celery + SQS - pycurl error
                            
                                Django - Adding password validations in a ModelForm
                            
                                Tensorflow minimise with respect to only some elements of a variable
                            
                                Installing Graphviz for use with Python 3 on Windows 10

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to do a random stratified sampling with Python (Not a train/test split)?

Tags:

python

pandas

numpy

sampling

asl

People also ask

2 Answers

asl

Furkan Gursoy

Recent Activity

Donate For Us