Variations of this question have been asked (see this question) but I haven't found a good solution for would seem to be a common use-case of groupby
in Pandas.
Say I have the dataframe lasts
and I group by user
:
lasts = pd.DataFrame({'user':['a','s','d','d'],
'elapsed_time':[40000,50000,60000,90000],
'running_time':[30000,20000,30000,15000],
'num_cores':[7,8,9,4]})
And I have these functions I want to apply to groupby_obj
(what the functions do isn't important and I made them up, just know that they require multiple columns from the dataframe):
def custom_func(group):
return group.running_time.median() - group.num_cores.mean()
def custom_func2(group):
return max(group.elapsed_time) -min(group.running_time)
I could apply
each of these functions separately to the dataframe and then merge the resulting dataframes, but that seems inefficient, is inelegant, and I imagine there has to be a one-line solution.
I haven't really found one, although this blog post (search for "Create a function to get the stats of a group" towards the bottom of the page) suggested wrapping the functions into one function as a dictionary thusly:
def get_stats(group):
return {'custom_column_1': custom_func(group), 'custom_column_2':custom_func2(group)}
However, when I run the code groupby_obj.apply(get_stats)
, instead of columns I get a column of dictionary results:
user
a {'custom_column_1': 29993.0, 'custom_column_2'...
d {'custom_column_1': 22493.5, 'custom_column_2'...
s {'custom_column_1': 19992.0, 'custom_column_2'...
dtype: object
When in reality I would like to use a line of code to get something closer to this dataframe:
user custom_column_1 custom_column_2
a 29993.0 10000
d 22493.5 75000
s 19992.0 30000
Suggestions on improving this workflow?
If you would slightly modify the get_stats
function:
def get_stats(group):
return pd.Series({'custom_column_1': custom_func(group),
'custom_column_2':custom_func2(group)})
now you can simply do this:
In [202]: lasts.groupby('user').apply(get_stats).reset_index()
Out[202]:
user custom_column_1 custom_column_2
0 a 29993.0 10000.0
1 d 22493.5 75000.0
2 s 19992.0 30000.0
Alternative (bit ugly) approach which uses your functions (unchanged):
In [188]: pd.DataFrame(lasts.groupby('user')
.apply(get_stats).to_dict()) \
.T \
.rename_axis('user') \
.reset_index()
Out[188]:
user custom_column_1 custom_column_2
0 a 29993.0 10000.0
1 d 22493.5 75000.0
2 s 19992.0 30000.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With