I'm using pandas groupby on my DataFrame df which has columns type, subtype, and 11 others. I'm then calling an apply with my combine_function (needs a better name) on the groups like:
grouped = df('type')
reduced = grouped.apply(combine_function)
where my combine_function checks if any element in the group contains any element with the given subtype, say 1, and looks like:
def combine_function(group):
if 1 in group.subtype:
return aggregate_function(group)
else:
return group
The combine_function then can call an aggregate_function, that calculates summary statistics, stores them in the first row, and then sets that row to be the group. It looks like:
def aggregate_function(group):
first = group.first_valid_index()
group.value1[group.index == first] = group.value1.mean()
group.value2[group.index == first] = group.value2.max()
group.value3[group.index == first] = group.value3.std()
group = group[(group.index == first)]
return group
I'm fairly sure this isn't the best way to do this, but it has been giving my the desired results, 99.9% of the time on thousands of DataFrames. However it sometimes throws an error that is somehow related to a group that I don't want to aggregate has exactly 2 rows:
ValueError: Shape of passed values is (13,), indices imply (13, 5)
where my an example groups had size:
In [4]: grouped.size()
Out[4]:
type
1 9288
3 7667
5 7604
11 2
dtype: int64
It processed the 3 three fine, and then gave the error when it tried to combine everything. If I comment out the line group = group[(group.index == first)] so update but don't aggregate or call my aggregate_function on all groups its fine.
Does anyone know the proper way to be doing this kind of aggregation of some groups but not others?
Your aggregate_functions looks contorted to me. When you aggregate a group, it automatically reduces to one row; you don't need to do it manually. Maybe I am missing the point. (Are you doing something special with the index that I'm not understanding?) But a more normal usage would look like this:
agg_condition = lambda x: Series([1]).isin(x['subtype]').any()
agg_functions = {'value1': np.mean, 'value2': np.max, 'value3': np.std}
df1 = df.groupby('type').filter(agg_condition).groupby('type').agg(**agg_functions)
df2 = df.groupby('type').filter(~agg_condition)
result = pd.concat([df1, df2])
Note: agg_condition is messy because (1) built-in Python in refers to the index of a Series, not its values, and (2) the result has to be reduced to a scalar by any().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With