Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby apply function that combines some groups but not others

I'm using pandas groupby on my DataFrame df which has columns type, subtype, and 11 others. I'm then calling an apply with my combine_function (needs a better name) on the groups like:

    grouped = df('type')
    reduced = grouped.apply(combine_function)

where my combine_function checks if any element in the group contains any element with the given subtype, say 1, and looks like:

def combine_function(group):
    if 1 in group.subtype:
        return aggregate_function(group)
    else:
        return group

The combine_function then can call an aggregate_function, that calculates summary statistics, stores them in the first row, and then sets that row to be the group. It looks like:

def aggregate_function(group):
    first = group.first_valid_index()
    group.value1[group.index == first] = group.value1.mean()
    group.value2[group.index == first] = group.value2.max()
    group.value3[group.index == first] = group.value3.std()

    group = group[(group.index == first)]
    return group

I'm fairly sure this isn't the best way to do this, but it has been giving my the desired results, 99.9% of the time on thousands of DataFrames. However it sometimes throws an error that is somehow related to a group that I don't want to aggregate has exactly 2 rows:

ValueError: Shape of passed values is (13,), indices imply (13, 5)

where my an example groups had size:

In [4]: grouped.size()
Out[4]: 
type
1         9288
3         7667
5         7604
11           2
dtype: int64

It processed the 3 three fine, and then gave the error when it tried to combine everything. If I comment out the line group = group[(group.index == first)] so update but don't aggregate or call my aggregate_function on all groups its fine.

Does anyone know the proper way to be doing this kind of aggregation of some groups but not others?

like image 253
TristanMatthews Avatar asked Nov 23 '25 13:11

TristanMatthews


1 Answers

Your aggregate_functions looks contorted to me. When you aggregate a group, it automatically reduces to one row; you don't need to do it manually. Maybe I am missing the point. (Are you doing something special with the index that I'm not understanding?) But a more normal usage would look like this:

agg_condition = lambda x: Series([1]).isin(x['subtype]').any()
agg_functions = {'value1': np.mean, 'value2': np.max, 'value3': np.std}

df1 = df.groupby('type').filter(agg_condition).groupby('type').agg(**agg_functions)
df2 = df.groupby('type').filter(~agg_condition)

result = pd.concat([df1, df2])

Note: agg_condition is messy because (1) built-in Python in refers to the index of a Series, not its values, and (2) the result has to be reduced to a scalar by any().

like image 68
Dan Allan Avatar answered Nov 28 '25 15:11

Dan Allan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!