import pandas as pd
df = {'a': ['xxx', 'xxx','xxx','yyy','yyy','yyy'], 'start': [10000, 10500, 11000, 12000, 13000, 14000] }
df = pd.DataFrame(data=df)
df_new = df.groupby("a",as_index=True).agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
StartMin=pd.NamedAgg(column='start', aggfunc="min"),
StartMax=pd.NamedAgg(column='start', aggfunc="max"),
)
gives
>>>df_new
ProcessiveGroupLength StartMin StartMax
a
xxx 3 10000 11000
yyy 3 12000 14000
How to get below on the fly, since I think on the fly it will be faster.
>>>df_new
ProcessiveGroupLength Diff
a
xxx 3 1000
yyy 3 2000
Below code gives the following error message:
Traceback (most recent call last): File "", line 5, in TypeError: unsupported operand type(s) for -: 'str' and 'str'
df_new = df.groupby("a").agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
Diff=pd.NamedAgg(column='start', aggfunc="max"-"min"),)
Pandas Groupby Maximum The following is a step-by-step guide of what you need to do. Group the dataframe on the column(s) you want. Select the field(s) for which you want to estimate the maximum. Apply the pandas max() function directly or pass 'max' to the agg() function.
max. Compute max of group values. Include only float, int, boolean columns.
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Your solution should be changed by lambda function, but I think if many groups or/and large DataFrame this should be slowier like first solution.
Reason is optimalized functions max
and min
and also vectorized subtraction of Series
. In another words if not used lambda functions aggregations is faster.
df_new = df.groupby("a").agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)
Or yu can use numpy.ptp
:
df_new = df.groupby("a").agg(
ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)
print (df_new)
ProcessiveGroupLength Diff
a
xxx 3 1000
yyy 3 2000
Performance: Depends of data, here is used 1k groups in 1M rows:
np.random.seed(20)
N = 1000000
df = pd.DataFrame({'a': np.random.randint(1000, size=N),
'start':np.random.randint(10000, size=N)})
print (df)
In [229]: %%timeit
...: df_new = df.groupby("a",as_index=True).agg(
...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
...: StartMin=pd.NamedAgg(column='start', aggfunc="min"),
...: StartMax=pd.NamedAgg(column='start', aggfunc="max"),
...: ).assign(Diff = lambda x: x.pop('StartMax') - x.pop('StartMin'))
...:
69 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [230]: %%timeit
...: df_new = df.groupby("a").agg(
...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
...: Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)
...:
172 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [231]: %%timeit
...: df_new = df.groupby("a").agg(
...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
...: Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)
...:
171 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With