Assume having the following DataFrame
:
rng = pd.date_range('1/1/2011', periods=72, freq='H')
np.random.seed(10)
n = 10
df = pd.DataFrame(
{
"datetime": np.random.choice(rng,n),
"cat": np.random.choice(['a','b','b'], n),
"val": np.random.randint(0,5, size=n)
}
)
If I now groupby
:
gb = df.groupby(['cat','datetime']).sum()
I get the totals for each cat
for each hour:
cat datetime val
a 2011-01-01 00:00:00 1
2011-01-01 09:00:00 3
2011-01-02 16:00:00 1
2011-01-03 16:00:00 1
b 2011-01-01 08:00:00 4
2011-01-01 15:00:00 3
2011-01-01 16:00:00 3
2011-01-02 04:00:00 4
2011-01-02 05:00:00 1
2011-01-02 12:00:00 4
However, I would like to have something like:
cat datetime val
a 2011-01-01 4
2011-01-02 1
2011-01-03 1
b 2011-01-01 10
2011-01-02 9
I could get the desired result by adding another column called date
:
df['date'] = df.datetime.apply(pd.datetime.date)
and then do a similar groupby
: df.groupby(['cat','date']).sum()
. But I am interested whether there's more pythonic way to do it? In addition, I might want to have a look on the month or year level. So, what would be the right way?
How to perform groupby index in pandas? Pass index name of the DataFrame as a parameter to groupby() function to group rows on an index. DataFrame. groupby() function takes string or list as a param to specify the group columns or index.
Grouper. A Grouper allows the user to specify a groupby instruction for an object. This specification will select a column via the key parameter, or if the level and/or axis parameters are given, a level of the index of the target object.
From your intermediate structure, you can use .unstack
to separate the categories, do a .resample
, and then .stack
again to get back to the original form:
In [126]: gb = df.groupby(['cat', 'datetime']).sum()
In [127]: gb.unstack(0)
Out[127]:
val
cat a b
datetime
2011-01-01 00:00:00 1.0 NaN
2011-01-01 08:00:00 NaN 4.0
2011-01-01 09:00:00 3.0 NaN
2011-01-01 15:00:00 NaN 3.0
2011-01-01 16:00:00 NaN 3.0
2011-01-02 04:00:00 NaN 4.0
2011-01-02 05:00:00 NaN 1.0
2011-01-02 12:00:00 NaN 4.0
2011-01-02 16:00:00 1.0 NaN
2011-01-03 16:00:00 1.0 NaN
In [128]: gb.unstack(0).resample("D").sum().stack()
Out[128]:
val
datetime cat
2011-01-01 a 4.0
b 10.0
2011-01-02 a 1.0
b 9.0
2011-01-03 a 1.0
EDIT: For other resampling frequencies (month, year, etc.) there's a good list of the options at pandas resample documentation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With