It's straightforward to keep the last N rows for every group in a dataframe with something like df.groupby('ID').tail(N)
.
In my case, groups have different sizes and I would like to keep the same % of each group rather than same number of rows.
e.g if we want to keep the last 50% rows for each group (based on ID) for the following :
df = pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'],
'value' : [1,2,10,11,12,13,14,15]})
The result would be :
pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'],
'value' : [2,13,14,15]})
How can we get to that ?
EDIT : If x% is not an int, we round to the smallest closer int.
Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.
Groupby preserves the order of rows within each group. When calling apply, add group keys to index to identify pieces. Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
The tail() method returns the last n rows. By default, the last 5 rows are returned. You can specify the number of rows.
The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.
groupby
-apply
-tail
Pass the desired size to tail()
in a GroupBy.apply()
. This is simpler than the iloc
method below since it cleanly handles the "last 0 rows" case.
ratio = 0.6
(df.groupby('ID')
.apply(lambda x: x.tail(int(ratio * len(x))))
.reset_index(drop=True))
# ID value
# 0 A 2
# 1 B 13
# 2 B 14
# 3 B 15
ratio = 0.4
(df.groupby('ID')
.apply(lambda x: x.tail(int(ratio * len(x))))
.reset_index(drop=True))
# ID value
# 0 B 14
# 1 B 15
groupby
-apply
-iloc
Alternatively, index the desired size via iloc
/slicing, but this is clunkier since [-0:]
does not actually get the last 0 rows, so we have to check against that:
ratio = 0.6
(df.groupby('ID')
.apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
.reset_index(drop=True))
# ID value
# 0 A 2
# 1 B 13
# 2 B 14
# 3 B 15
ratio = 0.4
(df.groupby('ID')
.apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
.reset_index(drop=True))
# ID value
# 0 B 14
# 1 B 15
Like commented, there is no built-in option to do so. You can do something like:
groups = df.groupby('ID')
enums = groups.cumcount().add(1)
sizes = groups['ID'].transform('size')
df[enums/sizes > 0.5]
Output:
ID value
1 A 2
5 B 13
6 B 14
7 B 15
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With