I want to use groupby.agg
where my group is the entire dataframe. Put another way, I want to use the agg
functionality, without the groupby. I've looked for an example of this, but can not find it.
Here's what I've done:
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(6, 4), columns=list('ABCD'))
df
def describe(df):
funcs = dict(Kurt=lambda x: x.kurt(),
Skew='skew',
Mean='mean',
Std='std')
one_group = [True for _ in df.index]
funcs_for_all = {k: funcs for k in df.columns}
return df.groupby(one_group).agg(funcs_for_all).iloc[0].unstack().T
describe(df)
How was I supposed to have done this?
Instead of using groupby aggregation together, we can perform groupby without aggregation which is applicable to aggregate data separately.
At best you can use . first , . last to get respective values from the groupBy but not all in the way you can get in pandas. Since their is a basic difference between the way the data is handled in pandas and spark not all functionalities can be used in the same way.
agg() is used to pass a function or list of function to be applied on a series or even each element of series separately. In case of list of function, multiple results are returned by agg() method. Parameters: func: Function, list of function or string of function name to be called on Series.
agg is an alias for aggregate . Use the alias. A passed user-defined-function will be passed a Series for evaluation. The aggregation is for each column.
A small compaction of your own proposal, which I think improves readability, by exploiting that DataFrame.groupby()
accepts a lambda function:
def describe(df):
funcs = dict(Kurt=lambda x: x.kurt(),
Skew='skew',
Mean='mean',
Std='std')
funcs_for_all = {k: funcs for k in df.columns}
return df.groupby(lambda _ : True).agg(funcs_for_all).iloc[0].unstack().T
describe(df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With