I need to select half of a dataframe using the groupby
, where the size of each group is unknown and may vary across groups. For example:
index summary participant_id
0 130599 17.0 13
1 130601 18.0 13
2 130603 16.0 13
3 130605 15.0 13
4 130607 15.0 13
5 130609 16.0 13
6 130611 17.0 13
7 130613 15.0 13
8 130615 17.0 13
9 130617 17.0 13
10 86789 12.0 14
11 86791 8.0 14
12 86793 21.0 14
13 86795 19.0 14
14 86797 20.0 14
15 86799 9.0 14
16 86801 10.0 14
20 107370 1.0 15
21 107372 2.0 15
22 107374 2.0 15
23 107376 4.0 15
24 107378 4.0 15
25 107380 7.0 15
26 107382 6.0 15
27 107597 NaN 15
28 107384 14.0 15
The size of groups from groupyby('participant_id')
are 10, 7, 9 for participant_id
13, 14, 15 respectively. What I need is to take only the FIRST half (or floor(N/2)) of each group.
From my (very limited) experience with Pandas groupby
, it should be something like:
df.groupby('participant_id')[['summary','participant_id']].apply(lambda x: x[:k_i])
where k_i
is the half of the size of each group. Is there a simple solution to find the k_i
?
You can caluclate pandas percentage with total by groupby() and DataFrame. transform() method. The transform() method allows you to execute a function for each value of the DataFrame. Here, the percentage directly summarized DataFrame, then the results will be calculated using all the data.
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
IIUC, you can use index slicing with size //2 inside of lambda:
df.groupby('participant_id').apply(lambda x: x.iloc[:x.participant_id.size//2])
Output:
index summary participant_id
participant_id
13 0 130599 17.0 13
1 130601 18.0 13
2 130603 16.0 13
3 130605 15.0 13
4 130607 15.0 13
14 10 86789 12.0 14
11 86791 8.0 14
12 86793 21.0 14
15 20 107370 1.0 15
21 107372 2.0 15
22 107374 2.0 15
23 107376 4.0 15
You could group by participant_id
and check whether its index is in the first half with the transform
method. This will create a boolean Series. Then use this boolean series to filter out your original dataframe.
criteria = df.groupby('participant_id')['participant_id']\
.transform(lambda x: np.arange(len(x)) < int(len(x) / 2))
df[criteria]
index summary participant_id
0 130599 17.0 13
1 130601 18.0 13
2 130603 16.0 13
3 130605 15.0 13
4 130607 15.0 13
10 86789 12.0 14
11 86791 8.0 14
12 86793 21.0 14
20 107370 1.0 15
21 107372 2.0 15
22 107374 2.0 15
23 107376 4.0 15
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With