We are working to get off of SAS and onto Python/Pandas.  However, one thing we are having trouble with is creating a replacement for PROC SUMMARY (AKA PROC MEANS) that has the SAS routine's flexibility. For non-SAS users: PROC SUMMARY is just a routine that produces a table containing "descriptive statistics for variables across all observations or within groups of observations" in a dataset, to paraphrase the SAS documentation. Our requirements are just a small subset of the full functionality - outputting a table where we have:
We are not trying to do anything else (anything graphical, etc.)
Here is what we have so far:
def wmean_ungrouped (d,w):
    return (d.dot(w)).sum() / w.sum()
def wmean_grouped (group, var_name_in, var_name_weight):
    d = group[var_name_in]
    w = group[var_name_weight]
    return (d * w).sum() / w.sum()
FUNCS = {
    "mean"   : np.mean ,
    "sum"   : np.sum ,
    "count" : np.count_nonzero
}
def my_summary (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None
):
    result = DataFrame()
    if var_names_group is not None:
        grouped = data.groupby (var_names_group)
        for var_name_in, var_name_out, var_function in \
                zip(var_names_in,var_names_out,var_functions):
            if var_function == "wmean":
                func = lambda x : wmean_grouped (x, var_name_in, var_name_weight)
                result[var_name_out] = Series(grouped.apply(func))
            else:
                func = FUNCS[var_function]
                result[var_name_out] = grouped[var_name_in].apply(func)
    else:
        for var_name_in, var_name_out, var_function in \
                zip(var_names_in,var_names_out,var_functions):
            if var_function == "wmean":
                result[var_name_out] = \
                    Series(wmean_ungrouped(data[var_name_in], data[var_name_weight]))
            else:
                func = FUNCS[var_function]
                result[var_name_out] = Series(func(data[var_name_in]))
    return result
Here is a sample call to the my_summary() function:
    my_summary (
        data=df,
        var_names_in=["x_1","x_1","x_1","x_1"] ,
        var_names_out=[
            "x_1_c","x_1_s","x_1_m","x_1_wm"
        ] ,
        var_functions=["count","sum","mean","wmean"] ,
        var_name_weight="val_1" ,
        var_names_group=["Region","Category"]
)
my_summary() works, but as you can see, its implementation is not the prettiest. Here are the main issues:
DataFrame and DataFrameGroupBy have different ways for applying a programmatically-selected reducing function to a single column. For DataFrame, the only way I've found is directly invoking func(data[var_name_in]). data[var_name_in].apply(func) doesn't work because apply() on a Series does not reduce (unlike apply() on a DataFrame). On the other hand, for DataFrameGroupBy, I have to use that very approach: grouped[var_name_in].apply(func). That's because something like func(grouped[var_name_in]) will not work (no reason it should.)Series-type parameters and needs dot() to multiply and reduce them; the grouped function eventually deals with SeriesGroupBy objects and has to use the * operator (acknowledgements to the answer to this SO post for the weighted average function code.)So my questions are:
DataFrameGroupBy object from a DataFrame without grouping on any variable? Then the code paths would be reduced as we would be dealing with the DataFrameGroupBy interface exclusively.@JohnE's answer provides a way to group by nothing:  groupby(lambda x: True). This is a workaround that he spotted in this SO post (which, incidentally, features an answer from Wes himself speaking of the need for a DataFrame.agg(), which would serve the same purpose). @JohnE's excellent solution allows us to deal exclusively with objects of type DataFrameGroupBy, and instantly reduces most of the code paths. I was able to reduce further using some functional gimmickry that is now possible because we have only DataFrameGroupBy instances. Basically, all functions are generated as needed - the "generators" (in quotes here so as to not be confused with Python generator expressions)  take two parameters: value column name and weight column name, the second of which is ignored in all cases except wmean.  The generated functions are always applied over the entire DataFrameGroupBy, as was originally the case just with wmean, with the parameters being the correct column name(s) to use. I also replaced all the np.* implementations with pandas calculations, to better deal with NaN values. 
Unless there is something native to pandas that can do this, this is our solution:
FUNC_GENS = {
    "mean"  : lambda y,z : lambda x : x[y].mean(),
    "sum"   : lambda y,z : lambda x : x[y].sum() ,
    "count" : lambda y,z : lambda x : x[y].count() ,
    "wmean" : lambda y,z : lambda x : (x[y] * x[z]).sum() / x[z].sum()
}
def my_summary (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None ):
    result = pd.DataFrame()
    if var_names_group is None:
        grouped = data.groupby (lambda x: True)
    else:
        grouped = data.groupby (var_names_group)
    for var_name_in, var_name_out, var_function in \
            zip(var_names_in,var_names_out,var_functions):
        func_gen = FUNC_GENS[var_function]
        func = func_gen (var_name_in, var_name_weight)
        result[var_name_out] = grouped.apply(func)
    return result
Versions of pandas released subsequent to my original post now implement most of this functionality:
DataFrame.agg() and it was indeed added in version 0.20.0 along with a Series.agg().NamedAgg inputs to the agg() function So, basically everything except the weighted average. A good current solution for that is here.
Well, here's a quickie that does get at two issues (but still requires a different function for weighted mean).  Mostly it uses the trick here (credit to @DSM) to get around your empty group by doing groupby(lamda x: True).  It would be great if there was a kwarg for 'weights' on stuff like means but there is not, to my knowledge.  Apparently there is a package for weighted quantiles mentioned here based on numpy but I don't know anything about it.  Great project btw!
(note that names are mostly the same as yours, I just added a '2' to wmean_grouped and my_summary, otherwise you can use the same calling interface)
def wmean_grouped2 (group, var_name_in, var_name_weight):
    d = group[var_name_in]
    w = group[var_name_weight]
    return (d * w).sum() / w.sum()
FUNCS = { "mean"  : np.mean ,
          "sum"   : np.sum ,
          "count" : np.count_nonzero }
def my_summary2 (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None ):
    result = pd.DataFrame()
    if var_names_group is None:
        grouped = data.groupby (lambda x: True)
    else:
        grouped = data.groupby (var_names_group)
    for var_name_in, var_name_out, var_function in \
            zip(var_names_in,var_names_out,var_functions):
        if var_function == "wmean":
            func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)
            result[var_name_out] = pd.Series(grouped.apply(func))
        else:
            func = FUNCS[var_function]
            result[var_name_out] = grouped[var_name_in].apply(func)
    return result
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With