Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a partial SAS PROC SUMMARY replacement in Python/Pandas

Tags:

python

pandas

We are working to get off of SAS and onto Python/Pandas. However, one thing we are having trouble with is creating a replacement for PROC SUMMARY (AKA PROC MEANS) that has the SAS routine's flexibility. For non-SAS users: PROC SUMMARY is just a routine that produces a table containing "descriptive statistics for variables across all observations or within groups of observations" in a dataset, to paraphrase the SAS documentation. Our requirements are just a small subset of the full functionality - outputting a table where we have:

  • Ability to apply different stats to different columns (for now just count, sum, mean, weighted mean)
  • Ability to handle zero to many grouping variables
  • Ability to specify a weight variable for weighted mean

We are not trying to do anything else (anything graphical, etc.)

Here is what we have so far:

def wmean_ungrouped (d,w):
    return (d.dot(w)).sum() / w.sum()

def wmean_grouped (group, var_name_in, var_name_weight):
    d = group[var_name_in]
    w = group[var_name_weight]
    return (d * w).sum() / w.sum()

FUNCS = {
    "mean"   : np.mean ,
    "sum"   : np.sum ,
    "count" : np.count_nonzero
}

def my_summary (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None
):
    result = DataFrame()

    if var_names_group is not None:
        grouped = data.groupby (var_names_group)
        for var_name_in, var_name_out, var_function in \
                zip(var_names_in,var_names_out,var_functions):
            if var_function == "wmean":
                func = lambda x : wmean_grouped (x, var_name_in, var_name_weight)
                result[var_name_out] = Series(grouped.apply(func))
            else:
                func = FUNCS[var_function]
                result[var_name_out] = grouped[var_name_in].apply(func)
    else:
        for var_name_in, var_name_out, var_function in \
                zip(var_names_in,var_names_out,var_functions):
            if var_function == "wmean":
                result[var_name_out] = \
                    Series(wmean_ungrouped(data[var_name_in], data[var_name_weight]))
            else:
                func = FUNCS[var_function]
                result[var_name_out] = Series(func(data[var_name_in]))

    return result

Here is a sample call to the my_summary() function:

    my_summary (
        data=df,
        var_names_in=["x_1","x_1","x_1","x_1"] ,
        var_names_out=[
            "x_1_c","x_1_s","x_1_m","x_1_wm"
        ] ,
        var_functions=["count","sum","mean","wmean"] ,
        var_name_weight="val_1" ,
        var_names_group=["Region","Category"]
)

my_summary() works, but as you can see, its implementation is not the prettiest. Here are the main issues:

  • Two different code paths depending on grouped or ungrouped - this stems completely from the fact that DataFrame and DataFrameGroupBy have different ways for applying a programmatically-selected reducing function to a single column. For DataFrame, the only way I've found is directly invoking func(data[var_name_in]). data[var_name_in].apply(func) doesn't work because apply() on a Series does not reduce (unlike apply() on a DataFrame). On the other hand, for DataFrameGroupBy, I have to use that very approach: grouped[var_name_in].apply(func). That's because something like func(grouped[var_name_in]) will not work (no reason it should.)
  • Special treatment for weighted mean - this is because it operates on two columns, unlike all the other calculations, which operate on just one; I don't know if this can be helped.
  • Two different weighted mean functions - this is a consequence of the first issue. The ungrouped function has Series-type parameters and needs dot() to multiply and reduce them; the grouped function eventually deals with SeriesGroupBy objects and has to use the * operator (acknowledgements to the answer to this SO post for the weighted average function code.)

So my questions are:

  • Is there something native to pandas that can do all of this (i.e. throw out the above and use that instead)?
  • If not, are there any fixes to any of the issues mentioned above?
  • By any chance, is there some way to group by nothing - that is, to obtain a DataFrameGroupBy object from a DataFrame without grouping on any variable? Then the code paths would be reduced as we would be dealing with the DataFrameGroupBy interface exclusively.

Update (old - scroll down for current)

@JohnE's answer provides a way to group by nothing: groupby(lambda x: True). This is a workaround that he spotted in this SO post (which, incidentally, features an answer from Wes himself speaking of the need for a DataFrame.agg(), which would serve the same purpose). @JohnE's excellent solution allows us to deal exclusively with objects of type DataFrameGroupBy, and instantly reduces most of the code paths. I was able to reduce further using some functional gimmickry that is now possible because we have only DataFrameGroupBy instances. Basically, all functions are generated as needed - the "generators" (in quotes here so as to not be confused with Python generator expressions) take two parameters: value column name and weight column name, the second of which is ignored in all cases except wmean. The generated functions are always applied over the entire DataFrameGroupBy, as was originally the case just with wmean, with the parameters being the correct column name(s) to use. I also replaced all the np.* implementations with pandas calculations, to better deal with NaN values.

Unless there is something native to pandas that can do this, this is our solution:

FUNC_GENS = {
    "mean"  : lambda y,z : lambda x : x[y].mean(),
    "sum"   : lambda y,z : lambda x : x[y].sum() ,
    "count" : lambda y,z : lambda x : x[y].count() ,
    "wmean" : lambda y,z : lambda x : (x[y] * x[z]).sum() / x[z].sum()
}

def my_summary (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None ):

    result = pd.DataFrame()

    if var_names_group is None:
        grouped = data.groupby (lambda x: True)
    else:
        grouped = data.groupby (var_names_group)

    for var_name_in, var_name_out, var_function in \
            zip(var_names_in,var_names_out,var_functions):
        func_gen = FUNC_GENS[var_function]
        func = func_gen (var_name_in, var_name_weight)
        result[var_name_out] = grouped.apply(func)

    return result

Update 2019 / Current Solution

Versions of pandas released subsequent to my original post now implement most of this functionality:

  • Grouping by nothing - Wes M. had spoken in the past of the need for a DataFrame.agg() and it was indeed added in version 0.20.0 along with a Series.agg().
  • Multiple aggregations of multiple columns, with specified names for output columns - this is now part of pandas as of version 0.25.+ in the form of NamedAgg inputs to the agg() function

So, basically everything except the weighted average. A good current solution for that is here.

like image 989
sparc_spread Avatar asked Apr 28 '15 18:04

sparc_spread


1 Answers

Well, here's a quickie that does get at two issues (but still requires a different function for weighted mean). Mostly it uses the trick here (credit to @DSM) to get around your empty group by doing groupby(lamda x: True). It would be great if there was a kwarg for 'weights' on stuff like means but there is not, to my knowledge. Apparently there is a package for weighted quantiles mentioned here based on numpy but I don't know anything about it. Great project btw!

(note that names are mostly the same as yours, I just added a '2' to wmean_grouped and my_summary, otherwise you can use the same calling interface)

def wmean_grouped2 (group, var_name_in, var_name_weight):
    d = group[var_name_in]
    w = group[var_name_weight]
    return (d * w).sum() / w.sum()

FUNCS = { "mean"  : np.mean ,
          "sum"   : np.sum ,
          "count" : np.count_nonzero }

def my_summary2 (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None ):

    result = pd.DataFrame()

    if var_names_group is None:
        grouped = data.groupby (lambda x: True)
    else:
        grouped = data.groupby (var_names_group)

    for var_name_in, var_name_out, var_function in \
            zip(var_names_in,var_names_out,var_functions):
        if var_function == "wmean":
            func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)
            result[var_name_out] = pd.Series(grouped.apply(func))
        else:
            func = FUNCS[var_function]
            result[var_name_out] = grouped[var_name_in].apply(func)

    return result
like image 158
JohnE Avatar answered Sep 18 '22 09:09

JohnE