<pre class="prettyprint"><code>import pandas as pd df = {'a': ['xxx', 'xxx','xxx','yyy','yyy','yyy'], 'start': [10000, 10500, 11000, 12000, 13000, 14000] } df = pd.DataFrame(data=df) df_new = df.groupby("a",as_index=True).agg( ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"), StartMin=pd.NamedAgg(column='start', aggfunc="min"), StartMax=pd.NamedAgg(column='start', aggfunc="max"), ) </code></pre> gives <pre class="prettyprint"><code>>>>df_new ProcessiveGroupLength StartMin StartMax a xxx 3 10000 11000 yyy 3 12000 14000 </code></pre> How to get below on the fly, since I think on the fly it will be faster. <pre class="prettyprint"><code>>>>df_new ProcessiveGroupLength Diff a xxx 3 1000 yyy 3 2000 </code></pre> Below code gives the following error message: Traceback (most recent call last): File "", line 5, in TypeError: unsupported operand type(s) for -: 'str' and 'str' <pre class="prettyprint"><code>df_new = df.groupby("a").agg( ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"), Diff=pd.NamedAgg(column='start', aggfunc="max"-"min"),) </code></pre>

Your solution should be changed by lambda function, but I think if many groups or/and large DataFrame this should be slowier like first solution. Reason is optimalized functions <code>max</code> and <code>min</code> and also vectorized subtraction of <code>Series</code>. In another words if not used lambda functions aggregations is faster. <pre class="prettyprint"><code>df_new = df.groupby("a").agg( ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"), Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),) </code></pre> Or yu can use <code>numpy.ptp</code>: <pre class="prettyprint"><code>df_new = df.groupby("a").agg( ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"), Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),) </code></pre> <hr> <pre class="prettyprint"><code>print (df_new) ProcessiveGroupLength Diff a xxx 3 1000 yyy 3 2000 </code></pre> Performance: Depends of data, here is used 1k groups in 1M rows: <pre class="prettyprint"><code>np.random.seed(20) N = 1000000 df = pd.DataFrame({'a': np.random.randint(1000, size=N), 'start':np.random.randint(10000, size=N)}) print (df) In [229]: %%timeit ...: df_new = df.groupby("a",as_index=True).agg( ...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"), ...: StartMin=pd.NamedAgg(column='start', aggfunc="min"), ...: StartMax=pd.NamedAgg(column='start', aggfunc="max"), ...: ).assign(Diff = lambda x: x.pop('StartMax') - x.pop('StartMin')) ...: 69 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [230]: %%timeit ...: df_new = df.groupby("a").agg( ...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"), ...: Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),) ...: 172 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [231]: %%timeit ...: df_new = df.groupby("a").agg( ...: ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"), ...: Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),) ...: 171 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre>

Pandas Dataframe groupby aggregate functions and difference between max and min of a column on the fly

Tags:

pandas

dataframe

aggregate-functions

pandas-groupby

import pandas as pd

df = {'a': ['xxx', 'xxx','xxx','yyy','yyy','yyy'], 'start': [10000, 10500, 11000, 12000, 13000, 14000] }
df = pd.DataFrame(data=df)


df_new = df.groupby("a",as_index=True).agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
            StartMin=pd.NamedAgg(column='start', aggfunc="min"),
            StartMax=pd.NamedAgg(column='start', aggfunc="max"),
            )

gives

>>>df_new
     ProcessiveGroupLength  StartMin  StartMax
a
xxx                      3     10000     11000
yyy                      3     12000     14000

How to get below on the fly, since I think on the fly it will be faster.

>>>df_new
     ProcessiveGroupLength    Diff
a
xxx                      3      1000
yyy                      3      2000

Below code gives the following error message:

Traceback (most recent call last): File "", line 5, in TypeError: unsupported operand type(s) for -: 'str' and 'str'

df_new = df.groupby("a").agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),                
            Diff=pd.NamedAgg(column='start', aggfunc="max"-"min"),)

503

asked Sep 17 '20 04:09

burcak

1 Answers

Your solution should be changed by lambda function, but I think if many groups or/and large DataFrame this should be slowier like first solution.

Reason is optimalized functions max and min and also vectorized subtraction of Series. In another words if not used lambda functions aggregations is faster.

df_new = df.groupby("a").agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
            Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)

Or yu can use numpy.ptp:

df_new = df.groupby("a").agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
            Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)

print (df_new)
     ProcessiveGroupLength  Diff
a                               
xxx                      3  1000
yyy                      3  2000

Performance: Depends of data, here is used 1k groups in 1M rows:

np.random.seed(20)

N = 1000000
df = pd.DataFrame({'a': np.random.randint(1000, size=N),
                   'start':np.random.randint(10000, size=N)})
print (df)

In [229]: %%timeit
     ...: df_new = df.groupby("a",as_index=True).agg(
     ...:             ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
     ...:             StartMin=pd.NamedAgg(column='start', aggfunc="min"),
     ...:             StartMax=pd.NamedAgg(column='start', aggfunc="max"),
     ...:             ).assign(Diff = lambda x: x.pop('StartMax') - x.pop('StartMin'))
     ...:             
69 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [230]: %%timeit
     ...: df_new = df.groupby("a").agg(
     ...:             ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
     ...:             Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)
     ...:             
172 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [231]: %%timeit
     ...: df_new = df.groupby("a").agg(
     ...:             ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
     ...:             Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)
     ...:             
171 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

162

answered Nov 24 '22 18:11

jezrael

Related questions
                            
                                Increase width of a specific column while converting pandas Dataframes to PDF
                            
                                Dataframe fillna conditional based on Index & Column Name
                            
                                How to get unique values of a dataframe column when there are lists - python
                            
                                Python Pandas: Apply function using column names as named arguments
                            
                                Convert JSON file to Pandas dataframe
                            
                                Python read pickle protocol 4 error: STACK_GLOBAL requires str
                            
                                How to get the mean of pandas cut categorical column
                            
                                Why use a double square bracket in Pandas?
                            
                                pivot a dataframe by diagonals
                            
                                In which file is a specified dataframe's attribution definition, such as columns, located?
                            
                                Telegram bot returning null
                            
                                why pandas.DataFrame.sum(axis=0) returns sum of values in each column where axis =0 represent rows?
                            
                                Add Missing Date Index in a multiindex dataframe
                            
                                get rows where n of m values are wrong answered
                            
                                Auto place annotation bubble
                            
                                Drop last n rows within pandas dataframe groupby
                            
                                How to ignore min & max value in group when calculating weighted mean by group in Pandas
                            
                                Plotly: How to animate a bar chart with multiple groups using plotly express?
                            
                                How to get second highest value in a pandas column for a certain ID?
                            
                                How to drop row at certain index in every group in GroupBy object?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With