Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stratified Sampling in Pandas

I've looked at the Sklearn stratified sampling docs as well as the pandas docs and also Stratified samples from Pandas and sklearn stratified sampling based on a column but they do not address this issue.

Im looking for a fast pandas/sklearn/numpy way to generate stratified samples of size n from a dataset. However, for rows with less than the specified sampling number, it should take all of the entries.

Concrete example:

enter image description here

Thank you! :)

like image 507
Wboy Avatar asked May 22 '17 13:05

Wboy


People also ask

What is an example of stratified sampling?

A stratified sample is one that ensures that subgroups (strata) of a given population are each adequately represented within the whole sample population of a research study. For example, one might divide a sample of adults into subgroups by age, like 18–29, 30–39, 40–49, 50–59, and 60 and above.

What is stratified sampling in data mining?

A stratified sample is defined as one resulting from classification of population into mutually exclusive groups, called strata, and choosing a simple random sample from each stratum. The main reason for using stratified sampling instead of simple random sampling is improved efficiency of sampling [2,3].

What is stratified probability sampling?

Stratified random sampling is a type of probability sampling using which a research organization can branch off the entire population into multiple non-overlapping, homogeneous groups (strata) and randomly choose final members from the various strata for research which reduces cost and improves efficiency.


2 Answers

Use min when passing the number to sample. Consider the dataframe df

df = pd.DataFrame(dict(         A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4],         B=range(10)     ))  df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 2)))     A  B 1  1  1 2  1  2 3  2  3 6  2  6 7  3  7 9  4  9 8  4  8 
like image 162
piRSquared Avatar answered Sep 29 '22 11:09

piRSquared


Extending the groupby answer, we can make sure that sample is balanced. To do so, when for all classes the number of samples is >= n_samples, we can just take n_samples for all classes (previous answer). When minority class contains < n_samples, we can take the number of samples for all classes to be the same as of minority class.

def stratified_sample_df(df, col, n_samples):     n = min(n_samples, df[col].value_counts().min())     df_ = df.groupby(col).apply(lambda x: x.sample(n))     df_.index = df_.index.droplevel(0)     return df_ 
like image 31
Ilya Prokin Avatar answered Sep 29 '22 11:09

Ilya Prokin