I have a Pandas DataFrame that includes a date
column. Elements of that column are of type pandas.tslib.Timestamp
.
I'd like to group the dataframe by date, but exclude timestamp information that is more granular that date (ie. grouping by date, where all Feb 23, 2011
are grouped). I know how to express this in SQL, but am quite new to Pandas.
This question does something very similar, but I don't understand the code and it uses datetime
objects.
From the documentation, I don't even understand how to retrieve the date from a Pandas Timestamp object. I could convert to datetime
object, but that seems very roundabout.
As requested, the output of df.head()
:
date show network timed session_id
0 2011-12-03 02:48:52 Monk TV38 670 00003DA9-01D2-E7A9-4177-203BE6A9E2BA
1 2011-12-03 03:00:09 WBZ News TV38 205 00003DA9-01D2-E7A9-4177-203BE6A9E2BA
2 2011-12-03 03:04:04 Dateline NBC NBC 30 00003DA9-01D2-E7A9-4177-203BE6A9E2BA
3 2011-12-03 03:04:35 20/20 ABC 25 00003DA9-01D2-E7A9-4177-203BE6A9E2BA
4 2011-12-03 03:04:56 College Football FOX 55 00003DA9-01D2-E7A9-4177-203BE6A9E2BA
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes.
Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
You can use the normalize
DatetimeIndex method (which takes it to midnight that day):
In [11]: df['date']
Out[11]:
0 2011-12-03 02:48:52
1 2011-12-03 03:00:09
2 2011-12-03 03:04:04
3 2011-12-03 03:04:35
4 2011-12-03 03:04:56
Name: date, dtype: datetime64[ns]
In [12]: pd.DatetimeIndex(df['date']).normalize()
Out[12]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-12-03 00:00:00, ..., 2011-12-03 00:00:00]
Length: 5, Freq: None, Timezone: None
And you can groupby this:
g = df.groupby(pd.DatetimeIndex(df['date']).normalize())
In 0.15 you'll have access to the dt attribute, so can write this as:
g = df.groupby(df['date'].dt.normalize())
It's not clear if you are trying to groupby and aggregate (like in SQL) or create an index with a date instead of a timestamp.
If you're trying to groupby and aggregrate, you could do it this way:
df.groupby(df.set_index('date').index.date).mean()
Timeseries indexes have datetime properites like day, date, etc. That will aggregate the timed column since it's the only numeric column.
If you're trying to create an index with a date level, you could do something like:
import datetime
df.set_index(['date', df.date.apply(lambda x: datetime.datetime.date(x))], inplace=True)
df.index.names = ['timestamp', 'daydate']
This will give you a multiindex with the timestamp and a date . If you don't want the index to be permanent, drop the inplace= argument.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With