I have a dataframe with a category column. Df has different number of rows for each category.
category number_of_rows
cat1 19189
cat2 13193
cat3 4500
cat4 1914
cat5 568
cat6 473
cat7 216
cat8 206
cat9 197
cat10 147
cat11 130
cat12 49
cat13 38
cat14 35
cat15 35
cat16 30
cat17 29
cat18 9
cat19 4
cat20 4
cat21 1
cat22 1
cat23 1
I want to select different number of rows from each category. (Instead of n fixed number of rows from each category)
Example input:
size_1 : {"cat1": 40, "cat2": 20, "cat3": 15, "cat4": 11, ...}
Example input:
size_2 : {"cat1": 51, "cat2": 42, "cat3": 18, "cat4": 21, ...}
What I want to do is actually a stratified sampling with given number of instances corresponding to each category.
Also, it should be randomly selected. For example, I don't need the top 40 values for size_1.["cat1"], I need random 40 values.
Thanks for the help.
You can use pandas DataFrame. groupby(). count() to group columns and compute the count or size aggregate, this calculates a rows count for each group combination.
We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
Let's first generate some data to see how we can solve the problem:
# Define a DataFrame containing employee data
df = pd.DataFrame({'Category':['Jai', 'Jai', 'Jai', 'Princi', 'Princi'],
'Age':[27, 24, 22, 32, 15],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Noida'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd', '10th']} )
# Number of rows, that we want to be sampled from each category
samples_per_group_dict = {'Jai': 1,
'Princi':2}
I can propose two solutions:
Apply on groupby (one-liner)
output = df.groupby('Category').apply(lambda group: group.sample(samples_per_group_dict[group.name])).reset_index(drop = True)
Looping groups (more verbose)
list_of_sampled_groups = []
for name, group in df.groupby('Category'):
n_rows_to_sample = samples_per_group_dict[name]
sampled_group = group.sample(n_rows_to_sample)
list_of_sampled_groups.append(sampled_group)
output = pd.concat(list_of_sampled_groups).reset_index(drop=True)
Performance should be the same for both approaches. If performance matters you can vectorize your calculation. But exact optimization depends on n_groups and n_samples in each group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With