I'm starting with this dataframe:
df = pd.DataFrame(
[
["a", "aa", "2020-12-20", 10],
["a", "ab", "2020-12-26", 11],
["a", "aa", "2020-12-22", 10],
["b", "bb", "2020-12-25", 111],
["c", "bb", "2020-12-20", 20],
["d", "dd", "2020-12-05", 1111]
],
columns=["cat", "user", "date", "value"]
)
df["date"] = pd.to_datetime(df.date)
cat | user | date | value | |
---|---|---|---|---|
0 | a | aa | 2020-12-20 00:00:00 | 10 |
1 | a | ab | 2020-12-26 00:00:00 | 11 |
2 | a | aa | 2020-12-22 00:00:00 | 10 |
3 | b | bb | 2020-12-25 00:00:00 | 111 |
4 | c | bb | 2020-12-20 00:00:00 | 20 |
5 | d | dd | 2020-12-05 00:00:00 | 1111 |
Next, I'm running the following aggregation:
gb = (
df.set_index("date")
.groupby("cat")
.resample("W")
.agg(
{"value": "sum", "user": ["nunique", lambda x: x.unique()]}
)
.rename({"<lambda>": "unqiue_users"}, axis=1)
)
This yields a table with multiindex in the columns:
value user
sum nunique unqiue_users
cat date
a 2020-12-20 10 1 aa
2020-12-27 21 2 [aa, ab]
b 2020-12-27 111 1 bb
c 2020-12-20 20 1 bb
d 2020-12-06 1111 1 dd
Lastly, I'm trying to run aggregations on the last result like:
gb.groupby(level=0)[["value", "sum"]].mean()
I don't know how to "access" the columns that have multiindex. Any idea?
Often you may want to group and aggregate by multiple columns of a pandas DataFrame. Fortunately this is easy to do using the pandas .groupby () and .agg () functions. This tutorial explains several examples of how to use these functions in practice.
Note: When we do multiple aggregations on a single column (when there is a list of aggregation operations), the resultant data frame column names will have multiple levels. To access them easily, we must flatten the levels – which we will see at the end of this note. For now, let’s proceed to the next level of aggregation.
To create a MultiIndex with our original DataFrame, all we need to do is pass a list of columns into the .set_index () Pandas function like this: Here, we can already see that the new DataFrame called “multi” has been organized so that there are now four columns that make up the index.
A MultiIndex, also known as a multi-level index or hierarchical index, allows you to have multiple columns acting as a row identifier, while having each index column related to another through a parent/child relationship.
For select MultiIndex and used tuples, here is used one element list:
print (gb.groupby(level=0)[[("value", "sum")]].mean())
value
sum
cat
a 15.5
b 111.0
c 20.0
d 1111.0
Or you can use simplify solution with mean
per level:
print (gb[[("value", "sum")]].mean(level=0))
value
sum
cat
a 15.5
b 111.0
c 20.0
d 1111.0
For Series
select omit nested list:
print (gb[("value", "sum")].mean(level=0))
cat
a 15.5
b 111.0
c 20.0
d 1111.0
Name: (value, sum), dtype: float64
Your solution should be changed for avoid MultiIndex
in columns:
gb = (
df.set_index("date")
.groupby(["cat", pd.Grouper(freq='W')])
.agg(val = ("value", "sum"),
nuniq = ("user", "nunique"),
unqiue_users = ("user", lambda x: x.unique()))
)
print (gb)
val nuniq unqiue_users
cat date
a 2020-12-20 10 1 aa
2020-12-27 21 2 [ab, aa]
b 2020-12-27 111 1 bb
c 2020-12-20 20 1 bb
d 2020-12-06 1111 1 dd
print (gb['val'].mean(level=0))
cat
a 15.5
b 111.0
c 20.0
d 1111.0
Name: val, dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With