We are working to get off of SAS and onto Python/Pandas. However, one thing we are having trouble with is creating a replacement for PROC SUMMARY
(AKA PROC MEANS
) that has the SAS routine's flexibility. For non-SAS users: PROC SUMMARY
is just a routine that produces a table containing "descriptive statistics for variables across all observations or within groups of observations" in a dataset, to paraphrase the SAS documentation. Our requirements are just a small subset of the full functionality - outputting a table where we have:
We are not trying to do anything else (anything graphical, etc.)
Here is what we have so far:
def wmean_ungrouped (d,w):
return (d.dot(w)).sum() / w.sum()
def wmean_grouped (group, var_name_in, var_name_weight):
d = group[var_name_in]
w = group[var_name_weight]
return (d * w).sum() / w.sum()
FUNCS = {
"mean" : np.mean ,
"sum" : np.sum ,
"count" : np.count_nonzero
}
def my_summary (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None
):
result = DataFrame()
if var_names_group is not None:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
func = lambda x : wmean_grouped (x, var_name_in, var_name_weight)
result[var_name_out] = Series(grouped.apply(func))
else:
func = FUNCS[var_function]
result[var_name_out] = grouped[var_name_in].apply(func)
else:
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
result[var_name_out] = \
Series(wmean_ungrouped(data[var_name_in], data[var_name_weight]))
else:
func = FUNCS[var_function]
result[var_name_out] = Series(func(data[var_name_in]))
return result
Here is a sample call to the my_summary()
function:
my_summary (
data=df,
var_names_in=["x_1","x_1","x_1","x_1"] ,
var_names_out=[
"x_1_c","x_1_s","x_1_m","x_1_wm"
] ,
var_functions=["count","sum","mean","wmean"] ,
var_name_weight="val_1" ,
var_names_group=["Region","Category"]
)
my_summary()
works, but as you can see, its implementation is not the prettiest. Here are the main issues:
DataFrame
and DataFrameGroupBy
have different ways for applying a programmatically-selected reducing function to a single column. For DataFrame
, the only way I've found is directly invoking func(data[var_name_in])
. data[var_name_in].apply(func)
doesn't work because apply()
on a Series
does not reduce (unlike apply()
on a DataFrame
). On the other hand, for DataFrameGroupBy
, I have to use that very approach: grouped[var_name_in].apply(func)
. That's because something like func(grouped[var_name_in])
will not work (no reason it should.)Series
-type parameters and needs dot()
to multiply and reduce them; the grouped function eventually deals with SeriesGroupBy
objects and has to use the *
operator (acknowledgements to the answer to this SO post for the weighted average function code.)So my questions are:
DataFrameGroupBy
object from a DataFrame
without grouping on any variable? Then the code paths would be reduced as we would be dealing with the DataFrameGroupBy
interface exclusively.@JohnE's answer provides a way to group by nothing: groupby(lambda x: True)
. This is a workaround that he spotted in this SO post (which, incidentally, features an answer from Wes himself speaking of the need for a DataFrame.agg()
, which would serve the same purpose). @JohnE's excellent solution allows us to deal exclusively with objects of type DataFrameGroupBy
, and instantly reduces most of the code paths. I was able to reduce further using some functional gimmickry that is now possible because we have only DataFrameGroupBy
instances. Basically, all functions are generated as needed - the "generators" (in quotes here so as to not be confused with Python generator expressions) take two parameters: value column name and weight column name, the second of which is ignored in all cases except wmean
. The generated functions are always applied over the entire DataFrameGroupBy
, as was originally the case just with wmean
, with the parameters being the correct column name(s) to use. I also replaced all the np.*
implementations with pandas calculations, to better deal with NaN
values.
Unless there is something native to pandas that can do this, this is our solution:
FUNC_GENS = {
"mean" : lambda y,z : lambda x : x[y].mean(),
"sum" : lambda y,z : lambda x : x[y].sum() ,
"count" : lambda y,z : lambda x : x[y].count() ,
"wmean" : lambda y,z : lambda x : (x[y] * x[z]).sum() / x[z].sum()
}
def my_summary (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None ):
result = pd.DataFrame()
if var_names_group is None:
grouped = data.groupby (lambda x: True)
else:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
func_gen = FUNC_GENS[var_function]
func = func_gen (var_name_in, var_name_weight)
result[var_name_out] = grouped.apply(func)
return result
Versions of pandas released subsequent to my original post now implement most of this functionality:
DataFrame.agg()
and it was indeed added in version 0.20.0 along with a Series.agg()
.NamedAgg
inputs to the agg()
function So, basically everything except the weighted average. A good current solution for that is here.
Well, here's a quickie that does get at two issues (but still requires a different function for weighted mean). Mostly it uses the trick here (credit to @DSM) to get around your empty group by doing groupby(lamda x: True)
. It would be great if there was a kwarg for 'weights' on stuff like means but there is not, to my knowledge. Apparently there is a package for weighted quantiles mentioned here based on numpy but I don't know anything about it. Great project btw!
(note that names are mostly the same as yours, I just added a '2' to wmean_grouped and my_summary, otherwise you can use the same calling interface)
def wmean_grouped2 (group, var_name_in, var_name_weight):
d = group[var_name_in]
w = group[var_name_weight]
return (d * w).sum() / w.sum()
FUNCS = { "mean" : np.mean ,
"sum" : np.sum ,
"count" : np.count_nonzero }
def my_summary2 (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None ):
result = pd.DataFrame()
if var_names_group is None:
grouped = data.groupby (lambda x: True)
else:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)
result[var_name_out] = pd.Series(grouped.apply(func))
else:
func = FUNCS[var_function]
result[var_name_out] = grouped[var_name_in].apply(func)
return result
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With