I've looked at the Sklearn stratified sampling docs as well as the pandas docs and also Stratified samples from Pandas and sklearn stratified sampling based on a column but they do not address this issue.
Im looking for a fast pandas/sklearn/numpy way to generate stratified samples of size n from a dataset. However, for rows with less than the specified sampling number, it should take all of the entries.
Concrete example:
Thank you! :)
A stratified sample is one that ensures that subgroups (strata) of a given population are each adequately represented within the whole sample population of a research study. For example, one might divide a sample of adults into subgroups by age, like 18–29, 30–39, 40–49, 50–59, and 60 and above.
A stratified sample is defined as one resulting from classification of population into mutually exclusive groups, called strata, and choosing a simple random sample from each stratum. The main reason for using stratified sampling instead of simple random sampling is improved efficiency of sampling [2,3].
Stratified random sampling is a type of probability sampling using which a research organization can branch off the entire population into multiple non-overlapping, homogeneous groups (strata) and randomly choose final members from the various strata for research which reduces cost and improves efficiency.
Use min
when passing the number to sample. Consider the dataframe df
df = pd.DataFrame(dict( A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4], B=range(10) )) df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 2))) A B 1 1 1 2 1 2 3 2 3 6 2 6 7 3 7 9 4 9 8 4 8
Extending the groupby
answer, we can make sure that sample is balanced. To do so, when for all classes the number of samples is >= n_samples
, we can just take n_samples
for all classes (previous answer). When minority class contains < n_samples
, we can take the number of samples for all classes to be the same as of minority class.
def stratified_sample_df(df, col, n_samples): n = min(n_samples, df[col].value_counts().min()) df_ = df.groupby(col).apply(lambda x: x.sample(n)) df_.index = df_.index.droplevel(0) return df_
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With