How to sample different number of rows from each group in DataFrame

Tags:

I have a dataframe with a category column. Df has different number of rows for each category.

category number_of_rows
cat1     19189
cat2     13193
cat3     4500
cat4     1914
cat5     568
cat6     473
cat7     216
cat8     206
cat9     197
cat10    147
cat11    130
cat12    49
cat13    38
cat14    35
cat15    35
cat16    30
cat17    29
cat18    9
cat19    4
cat20    4
cat21    1
cat22    1
cat23    1

I want to select different number of rows from each category. (Instead of n fixed number of rows from each category)

Example input:
size_1 : {"cat1": 40, "cat2": 20, "cat3": 15, "cat4": 11, ...}
Example input: 
size_2 : {"cat1": 51, "cat2": 42, "cat3": 18, "cat4": 21, ...}

What I want to do is actually a stratified sampling with given number of instances corresponding to each category.

Also, it should be randomly selected. For example, I don't need the top 40 values for size_1.["cat1"], I need random 40 values.

Thanks for the help.

597

asked Dec 21 '19 15:12

Stolyassa

1 Answers

Artificial data generation

Dataframe

Let's first generate some data to see how we can solve the problem:

# Define a DataFrame containing employee data 
df = pd.DataFrame({'Category':['Jai', 'Jai', 'Jai', 'Princi', 'Princi'], 
        'Age':[27, 24, 22, 32, 15], 
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Noida'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd', '10th']} )

Sampling rule

# Number of rows, that we want to be sampled from each category 
samples_per_group_dict = {'Jai': 1, 
                          'Princi':2}

Problem solving

I can propose two solutions:

Apply on groupby (one-liner)

output = df.groupby('Category').apply(lambda group: group.sample(samples_per_group_dict[group.name])).reset_index(drop = True)

Looping groups (more verbose)

list_of_sampled_groups = []

for name, group in df.groupby('Category'):    
    n_rows_to_sample = samples_per_group_dict[name]
    sampled_group = group.sample(n_rows_to_sample)
    list_of_sampled_groups.append(sampled_group)

output = pd.concat(list_of_sampled_groups).reset_index(drop=True)

Performance should be the same for both approaches. If performance matters you can vectorize your calculation. But exact optimization depends on n_groups and n_samples in each group.

answered Oct 02 '22 22:10

Stas Buzuluk

Related questions
                            
                                One hot encoding of multi label images in keras
                            
                                How remove numbering from output after extract xls file with pandas [Python]
                            
                                Assign values to array efficiently
                            
                                Remove 2d slice from 3d numpy array
                            
                                Find all months between two date columns and generate row for each month
                            
                                How to set background color on image to white with OpenCV in Python
                            
                                torch.softmax and torch.sigmoid are not equivalent in the binary case
                            
                                How to list objects based on prefixes with wildcard using Python Boto3?
                            
                                Plotly: How to create subplots from each column in a pandas dataframe?
                            
                                Python GraphQL How to declare a self-referencing graphene object type
                            
                                Python pool not working in windows but works in linux
                            
                                How to scale train, validation and test sets properly using StandardScaler?
                            
                                How do you save a model, dictionary and corpus to disk in Gensim, and then load them again?
                            
                                Does order matter in Python's swap notation? (a, b = b, a)
                            
                                In pandas, how to select phrases in a dataframe from word list or word set?
                            
                                PyUSB: reading from a USB device
                            
                                Unable to run Python 3.7 on Windows 10 "Permission denied"
                            
                                What exactly is Python's SpooledTemporaryFile?
                            
                                Can I coordinate colors between seaborn and networkx?
                            
                                dynamic name in Altair alt.condition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to sample different number of rows from each group in DataFrame

Tags:

python

random

python-3.x

dataframe

pandas-groupby