Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping Pandas DataFrame by date

I have a Pandas DataFrame that includes a date column. Elements of that column are of type pandas.tslib.Timestamp.

I'd like to group the dataframe by date, but exclude timestamp information that is more granular that date (ie. grouping by date, where all Feb 23, 2011 are grouped). I know how to express this in SQL, but am quite new to Pandas.

This question does something very similar, but I don't understand the code and it uses datetime objects.

From the documentation, I don't even understand how to retrieve the date from a Pandas Timestamp object. I could convert to datetime object, but that seems very roundabout.


As requested, the output of df.head():

    date    show    network timed   session_id
0   2011-12-03 02:48:52  Monk    TV38    670     00003DA9-01D2-E7A9-4177-203BE6A9E2BA    
1   2011-12-03 03:00:09  WBZ News    TV38    205     00003DA9-01D2-E7A9-4177-203BE6A9E2BA
2   2011-12-03 03:04:04  Dateline NBC    NBC     30  00003DA9-01D2-E7A9-4177-203BE6A9E2BA
3   2011-12-03 03:04:35  20/20   ABC     25  00003DA9-01D2-E7A9-4177-203BE6A9E2BA
4   2011-12-03 03:04:56  College Football    FOX     55  00003DA9-01D2-E7A9-4177-203BE6A9E2BA
like image 931
David Chouinard Avatar asked Nov 21 '13 16:11

David Chouinard


People also ask

What is group by () in pandas library?

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes.

What is DataFrameGroupBy?

Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.


2 Answers

You can use the normalize DatetimeIndex method (which takes it to midnight that day):

In [11]: df['date']
Out[11]: 
0   2011-12-03 02:48:52
1   2011-12-03 03:00:09
2   2011-12-03 03:04:04
3   2011-12-03 03:04:35
4   2011-12-03 03:04:56
Name: date, dtype: datetime64[ns]

In [12]: pd.DatetimeIndex(df['date']).normalize()
Out[12]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-12-03 00:00:00, ..., 2011-12-03 00:00:00]
Length: 5, Freq: None, Timezone: None

And you can groupby this:

g = df.groupby(pd.DatetimeIndex(df['date']).normalize())

In 0.15 you'll have access to the dt attribute, so can write this as:

g = df.groupby(df['date'].dt.normalize())
like image 180
Andy Hayden Avatar answered Oct 16 '22 12:10

Andy Hayden


It's not clear if you are trying to groupby and aggregate (like in SQL) or create an index with a date instead of a timestamp.

If you're trying to groupby and aggregrate, you could do it this way:

df.groupby(df.set_index('date').index.date).mean()

Timeseries indexes have datetime properites like day, date, etc. That will aggregate the timed column since it's the only numeric column.

If you're trying to create an index with a date level, you could do something like:

import datetime
df.set_index(['date', df.date.apply(lambda x: datetime.datetime.date(x))], inplace=True)
df.index.names = ['timestamp', 'daydate']

This will give you a multiindex with the timestamp and a date . If you don't want the index to be permanent, drop the inplace= argument.

like image 27
Patrick Russell Avatar answered Oct 16 '22 10:10

Patrick Russell