Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby is duplicating groups when using apply twice

Can pandas groupby use groupby.apply(func) and inside the func use another instance of .apply() without duplicating and overwriting data?

In a way, the use of .apply() is nested.

Python 3.7.3 pandas==0.25.1

import pandas as pd


def dummy_func_nested(row):
    row['new_col_2'] = row['value'] * -1
    return row


def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = df_group.apply(dummy_func_nested, axis=1)

    return df_group


def pandas_groupby():
    # initialize data
    df = pd.DataFrame([
        {'country': 'US', 'value': 100.00, 'id': 'a'},
        {'country': 'US', 'value': 95.00, 'id': 'b'},
        {'country': 'CA', 'value': 56.00, 'id': 'y'},
        {'country': 'CA', 'value': 40.00, 'id': 'z'},
    ])

    # group by country and apply first dummy_func
    new_df = df.groupby('country').apply(dummy_func)

    # new_df and df should have the same list of countries
    assert new_df['country'].tolist() == df['country'].tolist()
    print(df)


if __name__ == '__main__':
    pandas_groupby()

The above code should return

  country  value id new_col_1  new_col_2
0      US  100.0  a      None     -100.0
1      US   95.0  b      None      -95.0
2      CA   56.0  y      None      -56.0
3      CA   40.0  z      None      -40.0

However, the code returns

  country  value id new_col_1  new_col_2
0      US  100.0  a      None     -100.0
1      US   95.0  a      None      -95.0
2      US   56.0  a      None      -56.0
3      US   40.0  a      None      -40.0

This behavior only appears to happen when both groups have an equal amount of rows. If one group has more rows, then the output is as expected.

like image 689
Oleh Dubno Avatar asked Oct 18 '25 20:10

Oleh Dubno


1 Answers

A quote from the documentation:

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

Try changing the below code in your code:

def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = df_group.apply(dummy_func_nested, axis=1)

    return df_group

To:

def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = dummy_func_nested(df_group)

    return df_group

You don't need the apply.

Of course, the more efficient way would be:

df['new_col_1'] = None
df['new_col_2'] = -df['value']
print(df)

Or:

print(df.assign(new_col_1=None, new_col_2=-df['value']))
like image 160
U12-Forward Avatar answered Oct 20 '25 10:10

U12-Forward



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!