Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sample different number of rows from each group in DataFrame

I have a dataframe with a category column. Df has different number of rows for each category.

category number_of_rows
cat1     19189
cat2     13193
cat3     4500
cat4     1914
cat5     568
cat6     473
cat7     216
cat8     206
cat9     197
cat10    147
cat11    130
cat12    49
cat13    38
cat14    35
cat15    35
cat16    30
cat17    29
cat18    9
cat19    4
cat20    4
cat21    1
cat22    1
cat23    1

I want to select different number of rows from each category. (Instead of n fixed number of rows from each category)

Example input:
size_1 : {"cat1": 40, "cat2": 20, "cat3": 15, "cat4": 11, ...}
Example input: 
size_2 : {"cat1": 51, "cat2": 42, "cat3": 18, "cat4": 21, ...}

What I want to do is actually a stratified sampling with given number of instances corresponding to each category.

Also, it should be randomly selected. For example, I don't need the top 40 values for size_1.["cat1"], I need random 40 values.

Thanks for the help.

like image 597
Stolyassa Avatar asked Dec 21 '19 15:12

Stolyassa


People also ask

How do I count the number of rows in each group of a Groupby object?

You can use pandas DataFrame. groupby(). count() to group columns and compute the count or size aggregate, this calculates a rows count for each group combination.

How do I count the number of rows with a specific value in Pandas?

We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.

How do I group specific rows in Pandas?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.

How do I split a DataFrame into a group?

Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.


1 Answers

Artificial data generation


Dataframe

Let's first generate some data to see how we can solve the problem:

# Define a DataFrame containing employee data 
df = pd.DataFrame({'Category':['Jai', 'Jai', 'Jai', 'Princi', 'Princi'], 
        'Age':[27, 24, 22, 32, 15], 
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Noida'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd', '10th']} )

Sampling rule

# Number of rows, that we want to be sampled from each category 
samples_per_group_dict = {'Jai': 1, 
                          'Princi':2}


Problem solving


I can propose two solutions:

  1. Apply on groupby (one-liner)

    output = df.groupby('Category').apply(lambda group: group.sample(samples_per_group_dict[group.name])).reset_index(drop = True)
    
  2. Looping groups (more verbose)

    list_of_sampled_groups = []
    
    for name, group in df.groupby('Category'):    
        n_rows_to_sample = samples_per_group_dict[name]
        sampled_group = group.sample(n_rows_to_sample)
        list_of_sampled_groups.append(sampled_group)
    
    output = pd.concat(list_of_sampled_groups).reset_index(drop=True)
    

Performance should be the same for both approaches. If performance matters you can vectorize your calculation. But exact optimization depends on n_groups and n_samples in each group.

like image 72
Stas Buzuluk Avatar answered Oct 02 '22 22:10

Stas Buzuluk