Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep X% last rows by group in Pandas

It's straightforward to keep the last N rows for every group in a dataframe with something like df.groupby('ID').tail(N).

In my case, groups have different sizes and I would like to keep the same % of each group rather than same number of rows.

e.g if we want to keep the last 50% rows for each group (based on ID) for the following :

df = pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'], 
'value' : [1,2,10,11,12,13,14,15]})

The result would be :

 pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'], 
    'value' : [2,13,14,15]})

How can we get to that ?

EDIT : If x% is not an int, we round to the smallest closer int.

like image 679
mlx Avatar asked Mar 10 '21 03:03

mlx


People also ask

How do you get the last 5 rows in pandas?

Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.

Does pandas Groupby preserve order?

Groupby preserves the order of rows within each group. When calling apply, add group keys to index to identify pieces. Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

What is a correct pandas method for returning the last rows?

The tail() method returns the last n rows. By default, the last 5 rows are returned. You can specify the number of rows.

Does Groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.


2 Answers

groupby-apply-tail

Pass the desired size to tail() in a GroupBy.apply(). This is simpler than the iloc method below since it cleanly handles the "last 0 rows" case.

ratio = 0.6
(df.groupby('ID')
   .apply(lambda x: x.tail(int(ratio * len(x))))
   .reset_index(drop=True))

#   ID  value
# 0  A      2
# 1  B     13
# 2  B     14
# 3  B     15
ratio = 0.4
(df.groupby('ID')
   .apply(lambda x: x.tail(int(ratio * len(x))))
   .reset_index(drop=True))

#   ID  value
# 0  B     14
# 1  B     15

groupby-apply-iloc

Alternatively, index the desired size via iloc/slicing, but this is clunkier since [-0:] does not actually get the last 0 rows, so we have to check against that:

ratio = 0.6
(df.groupby('ID')
   .apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
   .reset_index(drop=True))

#   ID  value
# 0  A      2
# 1  B     13
# 2  B     14
# 3  B     15
ratio = 0.4
(df.groupby('ID')
   .apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
   .reset_index(drop=True))

#   ID  value
# 0  B     14
# 1  B     15
like image 143
tdy Avatar answered Sep 27 '22 17:09

tdy


Like commented, there is no built-in option to do so. You can do something like:

groups = df.groupby('ID')

enums = groups.cumcount().add(1)
sizes = groups['ID'].transform('size')

df[enums/sizes > 0.5]

Output:

  ID  value
1  A      2
5  B     13
6  B     14
7  B     15
like image 25
Quang Hoang Avatar answered Sep 27 '22 17:09

Quang Hoang