Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preform aggregation(s) on multiindex columns

I'm starting with this dataframe:

df = pd.DataFrame(
    [
        ["a", "aa", "2020-12-20", 10],
        ["a", "ab", "2020-12-26", 11],
        ["a", "aa", "2020-12-22", 10],
        ["b", "bb", "2020-12-25", 111],
        ["c", "bb", "2020-12-20", 20],
        ["d", "dd", "2020-12-05", 1111]
    ],
    columns=["cat", "user", "date", "value"]
)
df["date"] = pd.to_datetime(df.date)
cat user date value
0 a aa 2020-12-20 00:00:00 10
1 a ab 2020-12-26 00:00:00 11
2 a aa 2020-12-22 00:00:00 10
3 b bb 2020-12-25 00:00:00 111
4 c bb 2020-12-20 00:00:00 20
5 d dd 2020-12-05 00:00:00 1111

Next, I'm running the following aggregation:

gb = (
    df.set_index("date")
    .groupby("cat")
    .resample("W")
    .agg(
        {"value": "sum", "user": ["nunique", lambda x: x.unique()]}
    )
    .rename({"<lambda>": "unqiue_users"}, axis=1)
)

This yields a table with multiindex in the columns:

               value    user             
                 sum nunique unqiue_users
cat date                                 
a   2020-12-20    10       1           aa
    2020-12-27    21       2     [aa, ab]
b   2020-12-27   111       1           bb
c   2020-12-20    20       1           bb
d   2020-12-06  1111       1           dd

Lastly, I'm trying to run aggregations on the last result like:

gb.groupby(level=0)[["value", "sum"]].mean()

I don't know how to "access" the columns that have multiindex. Any idea?

like image 868
Dror Avatar asked Jan 12 '21 07:01

Dror


People also ask

How to group and aggregate by multiple columns of a Dataframe?

Often you may want to group and aggregate by multiple columns of a pandas DataFrame. Fortunately this is easy to do using the pandas .groupby () and .agg () functions. This tutorial explains several examples of how to use these functions in practice.

What happens when we do multiple aggregations on a single column?

Note: When we do multiple aggregations on a single column (when there is a list of aggregation operations), the resultant data frame column names will have multiple levels. To access them easily, we must flatten the levels – which we will see at the end of this note. For now, let’s proceed to the next level of aggregation.

How to create a multiindex in a pandas Dataframe?

To create a MultiIndex with our original DataFrame, all we need to do is pass a list of columns into the .set_index () Pandas function like this: Here, we can already see that the new DataFrame called “multi” has been organized so that there are now four columns that make up the index.

What is a multiindex?

A MultiIndex, also known as a multi-level index or hierarchical index, allows you to have multiple columns acting as a row identifier, while having each index column related to another through a parent/child relationship.


1 Answers

For select MultiIndex and used tuples, here is used one element list:

print (gb.groupby(level=0)[[("value", "sum")]].mean())
      value
        sum
cat        
a      15.5
b     111.0
c      20.0
d    1111.0

Or you can use simplify solution with mean per level:

print (gb[[("value", "sum")]].mean(level=0))
      value
        sum
cat        
a      15.5
b     111.0
c      20.0
d    1111.0

For Series select omit nested list:

print (gb[("value", "sum")].mean(level=0))
cat
a      15.5
b     111.0
c      20.0
d    1111.0
Name: (value, sum), dtype: float64

Your solution should be changed for avoid MultiIndex in columns:

gb = (
    df.set_index("date")
    .groupby(["cat", pd.Grouper(freq='W')])
    .agg(val = ("value",  "sum"),
         nuniq = ("user", "nunique"),
         unqiue_users = ("user", lambda x: x.unique()))
    )
    
print (gb)
                 val  nuniq unqiue_users
cat date                                
a   2020-12-20    10      1           aa
    2020-12-27    21      2     [ab, aa]
b   2020-12-27   111      1           bb
c   2020-12-20    20      1           bb
d   2020-12-06  1111      1           dd


print (gb['val'].mean(level=0))
cat
a      15.5
b     111.0
c      20.0
d    1111.0
Name: val, dtype: float64
like image 116
jezrael Avatar answered Sep 23 '22 19:09

jezrael