Group by year/month/day in pandas

Tags:

Assume having the following DataFrame:

rng = pd.date_range('1/1/2011', periods=72, freq='H')
np.random.seed(10)
n = 10
df = pd.DataFrame(
    {
        "datetime": np.random.choice(rng,n),
        "cat": np.random.choice(['a','b','b'], n),
        "val": np.random.randint(0,5, size=n)
        }
    )

If I now groupby:

gb = df.groupby(['cat','datetime']).sum()

I get the totals for each cat for each hour:

cat datetime            val
a   2011-01-01 00:00:00 1
    2011-01-01 09:00:00 3
    2011-01-02 16:00:00 1
    2011-01-03 16:00:00 1
b   2011-01-01 08:00:00 4
    2011-01-01 15:00:00 3
    2011-01-01 16:00:00 3
    2011-01-02 04:00:00 4
    2011-01-02 05:00:00 1
    2011-01-02 12:00:00 4

However, I would like to have something like:

cat datetime   val
a   2011-01-01 4
    2011-01-02 1
    2011-01-03 1
b   2011-01-01 10
    2011-01-02 9

I could get the desired result by adding another column called date:

df['date'] = df.datetime.apply(pd.datetime.date)

and then do a similar groupby: df.groupby(['cat','date']).sum(). But I am interested whether there's more pythonic way to do it? In addition, I might want to have a look on the month or year level. So, what would be the right way?

677

asked Mar 09 '16 15:03

Dror

1 Answers

From your intermediate structure, you can use .unstack to separate the categories, do a .resample, and then .stack again to get back to the original form:

In [126]: gb = df.groupby(['cat', 'datetime']).sum()

In [127]: gb.unstack(0)
Out[127]:
                     val
cat                    a    b
datetime
2011-01-01 00:00:00  1.0  NaN
2011-01-01 08:00:00  NaN  4.0
2011-01-01 09:00:00  3.0  NaN
2011-01-01 15:00:00  NaN  3.0
2011-01-01 16:00:00  NaN  3.0
2011-01-02 04:00:00  NaN  4.0
2011-01-02 05:00:00  NaN  1.0
2011-01-02 12:00:00  NaN  4.0
2011-01-02 16:00:00  1.0  NaN
2011-01-03 16:00:00  1.0  NaN

In [128]: gb.unstack(0).resample("D").sum().stack()
Out[128]:
                 val
datetime   cat
2011-01-01 a     4.0
           b    10.0
2011-01-02 a     1.0
           b     9.0
2011-01-03 a     1.0

EDIT: For other resampling frequencies (month, year, etc.) there's a good list of the options at pandas resample documentation

167

answered Oct 02 '22 03:10

Randy

Related questions
                            
                                How do I get a PySpark DataFrame made using HiveContext in Spark 1.5.2?
                            
                                How to click on the text button using selenium python
                            
                                Python <No such file or directory: 'gs'> error even with GhostScript installed on Macintosh *Issue still Persisting!*
                            
                                Django Rest Framework: Correct way to serialize ListFields
                            
                                python itertools permutations with tied values
                            
                                Sharing numpy arrays between multiple processes without inheritance
                            
                                Sympy: integrate() strange output
                            
                                Queue objects should only be shared between processes through inheritance
                            
                                Shape of earth seems wrong in Skyfield - is my python correct?
                            
                                Python - why does time.sleep cause memory leak?
                            
                                python: stretch world map
                            
                                Google Cloud VM - Installing openCV
                            
                                Speeding up matrix-vector multiplication and exponentiation in Python, possibly by calling C/C++
                            
                                Pandas invalid type comparison error
                            
                                Where is my custom Django app code?
                            
                                What is the right way to pass inputs parameters to a Theano function?
                            
                                Embed Python Zip file throws error?
                            
                                How to enable mod_wsgi after pip install
                            
                                impyla hangs when connecting to HiveServer2
                            
                                Is it un-pythonic to define a function inside of a class method?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Group by year/month/day in pandas

Tags:

python

pandas

data-analysis

business-intelligence

Dror

People also ask

1 Answers

Randy

Recent Activity

Donate For Us