Pandas groupby is duplicating groups when using apply twice

Question

Can pandas groupby use groupby.apply(func) and inside the func use another instance of .apply() without duplicating and overwriting data?

In a way, the use of .apply() is nested.

Python 3.7.3 pandas==0.25.1

import pandas as pd


def dummy_func_nested(row):
    row['new_col_2'] = row['value'] * -1
    return row


def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = df_group.apply(dummy_func_nested, axis=1)

    return df_group


def pandas_groupby():
    # initialize data
    df = pd.DataFrame([
        {'country': 'US', 'value': 100.00, 'id': 'a'},
        {'country': 'US', 'value': 95.00, 'id': 'b'},
        {'country': 'CA', 'value': 56.00, 'id': 'y'},
        {'country': 'CA', 'value': 40.00, 'id': 'z'},
    ])

    # group by country and apply first dummy_func
    new_df = df.groupby('country').apply(dummy_func)

    # new_df and df should have the same list of countries
    assert new_df['country'].tolist() == df['country'].tolist()
    print(df)


if __name__ == '__main__':
    pandas_groupby()

The above code should return

  country  value id new_col_1  new_col_2
0      US  100.0  a      None     -100.0
1      US   95.0  b      None      -95.0
2      CA   56.0  y      None      -56.0
3      CA   40.0  z      None      -40.0

However, the code returns

  country  value id new_col_1  new_col_2
0      US  100.0  a      None     -100.0
1      US   95.0  a      None      -95.0
2      US   56.0  a      None      -56.0
3      US   40.0  a      None      -40.0

This behavior only appears to happen when both groups have an equal amount of rows. If one group has more rows, then the output is as expected.

U12-Forward · Accepted Answer

A quote from the documentation:

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

Try changing the below code in your code:

def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = df_group.apply(dummy_func_nested, axis=1)

    return df_group

To:

def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = dummy_func_nested(df_group)

    return df_group

You don't need the apply.

Of course, the more efficient way would be:

df['new_col_1'] = None
df['new_col_2'] = -df['value']
print(df)

Or:

print(df.assign(new_col_1=None, new_col_2=-df['value']))

Pandas groupby is duplicating groups when using apply twice

Tags:

python

pandas

duplicates

pandas-groupby

apply

Oleh Dubno

1 Answers

U12-Forward

Recent Activity

Donate For Us

Pandas groupby is duplicating groups when using apply twice

Tags:

python

pandas

duplicates

pandas-groupby

apply

Oleh Dubno

1 Answers

U12-Forward

Related questions

Recent Activity

Donate For Us