Stratified Sampling in Pandas

Tags:

I've looked at the Sklearn stratified sampling docs as well as the pandas docs and also Stratified samples from Pandas and sklearn stratified sampling based on a column but they do not address this issue.

Im looking for a fast pandas/sklearn/numpy way to generate stratified samples of size n from a dataset. However, for rows with less than the specified sampling number, it should take all of the entries.

Concrete example:

enter image description here

Thank you! :)

507

asked May 22 '17 13:05

Wboy

2 Answers

Use min when passing the number to sample. Consider the dataframe df

df = pd.DataFrame(dict(         A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4],         B=range(10)     ))  df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 2)))     A  B 1  1  1 2  1  2 3  2  3 6  2  6 7  3  7 9  4  9 8  4  8

162

answered Sep 29 '22 11:09

piRSquared

Extending the groupby answer, we can make sure that sample is balanced. To do so, when for all classes the number of samples is >= n_samples, we can just take n_samples for all classes (previous answer). When minority class contains < n_samples, we can take the number of samples for all classes to be the same as of minority class.

def stratified_sample_df(df, col, n_samples):     n = min(n_samples, df[col].value_counts().min())     df_ = df.groupby(col).apply(lambda x: x.sample(n))     df_.index = df_.index.droplevel(0)     return df_

answered Sep 29 '22 11:09

Ilya Prokin

Related questions
                            
                                __init__ as a constructor?
                            
                                How to right align level field in Python logging.Formatter
                            
                                Add a non-model field on a ModelSerializer in DRF 3
                            
                                Numpy remove a dimension from np array
                            
                                Encoding nested python object in JSON
                            
                                UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data
                            
                                Why does concatenation of DataFrames get exponentially slower?
                            
                                How to iterate over the file in python
                            
                                Python, Overriding an inherited class method
                            
                                How to access data when form.is_valid() is false
                            
                                How to set another Inline title in Django Admin?
                            
                                Python Script to convert Image into Byte array
                            
                                Difference between "fill" and "expand" options for tkinter pack method
                            
                                How can I select all rows with sqlalchemy?
                            
                                Editing django-rest-framework serializer object before save
                            
                                Grouping Python dictionary keys as a list and create a new dictionary with this list as a value
                            
                                iterating quickly through list of tuples
                            
                                How do I run uwsgi with virtualenv
                            
                                How to detect lines in OpenCV?
                            
                                Getting model attributes from pipeline

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Stratified Sampling in Pandas

Tags:

python

pandas

numpy

scikit-learn

Wboy

People also ask

2 Answers

piRSquared

Ilya Prokin

Recent Activity

Donate For Us