Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorized Operations on a datetime column in pandas

Tags:

python

pandas

I want to take a column of datetime objects and return a column of integers that are "days from that datetime until today". I can do it in an ugly way, looking for a prettier (and faster) way.

So suppose I have a dataframe with a datetime column like so:

11    2014-03-04 17:16:26+00:00
12    2014-03-10 01:35:56+00:00
13    2014-03-15 02:35:51+00:00
14    2014-03-20 05:55:47+00:00
15    2014-03-26 04:56:33+00:00
Name: datetime, dtype: object

And each element looks like:

datetime.datetime(2014, 3, 4, 17, 16, 26, tzinfo=<UTC>)

Suppose I want to calculate how many days ago each observation occurred, and return that as a simple integer. I know I can just use apply twice, but is there a vectorized/cleaner way to do it?

today = datetime.datetime.today().date()
df_dates = df['datetime'].apply(lambda x: x.date())
days_ago = today - df_dates

Which gives a timedelta64[ns] Series.

11   56 days, 00:00:00
12   50 days, 00:00:00
13   45 days, 00:00:00
14   40 days, 00:00:00
15   34 days, 00:00:00
Name: datetime, dtype: timedelta64[ns]

And then finally if I want it as an integer:

days_ago_as_int = days_ago.apply(lambda x: x.item().days)
days_ago_as_int
11    56
12    50
13    45
14    40
15    34
Name: datetime, dtype: int64

Any thoughts?


Related questions that didn't quite get at what I was asking:

Pandas Python- can datetime be used with vectorized inputs

Pandas add one day to column


Trying Karl D's answer, I'm successfully able to get today's date and the date column as desired, but something goes awry in the subtraction (different datetimes than in the original example, but shouldn't matter, right?):

converted_dates = df['date'].values.astype('datetime64[D]')
today_date = np.datetime64(dt.date.today())
print converted_dates
print today_date
print today_date - converted_dates

[2014-01-16 00:00:00 
2014-01-19 00:00:00 
2014-01-22 00:00:00
2014-01-26 00:00:00
2014-01-29 00:00:00]

2014-04-30 00:00:00

[16189 days, 0:08:20.637994
16189 days, 0:08:20.637991
16189 days, 0:08:20.637988
16189 days, 0:08:20.637984
16189 days, 0:08:20.637981]
like image 734
exp1orer Avatar asked Oct 21 '22 09:10

exp1orer


1 Answers

How about (for a column named date)?

import datetime as dt
df['foo'] = (np.datetime64(dt.date.today()) 
             - df['date'].values.astype('datetime64[D]'))
print df

                 date     foo
0 2014-03-04 17:16:26 56 days
1 2014-03-10 01:35:56 50 days
2 2014-03-15 02:35:51 45 days
3 2014-03-20 05:55:47 40 days
4 2014-03-26 04:56:33 34 days

Or if you wanted it as an int:

df['foo'] = (np.datetime64(dt.date.today()) 
             - df['date'].values.astype('datetime64[D]')).astype(int)
print df
                  date  foo
0 2014-03-04 17:16:26   56
1 2014-03-10 01:35:56   50
2 2014-03-15 02:35:51   45
3 2014-03-20 05:55:47   40
4 2014-03-26 04:56:33   34

Or if it was an index

print np.datetime64(dt.date.today()) - df.index.values.astype('datetime64[D]')

[56 50 45 40 34]

Much later Edit: How about this for a work around?

>>> print df

                 date
0 2014-03-04 17:16:26
1 2014-03-10 01:35:56
2 2014-03-15 02:35:51
3 2014-03-20 05:55:47
4 2014-03-26 04:56:33

Try assigning today's date to a column so it gets converted to a datetime64 column by pandas and then do the arithmetic:

>>> df['today'] = dt.date.today()
>>> df['foo'] = (df['today'].values.astype('datetime64[D]')
               - df['date'].values.astype('datetime64[D]'))
>>> print df

                 date       today     foo
0 2014-03-04 17:16:26  2014-05-14 71 days
1 2014-03-10 01:35:56  2014-05-14 65 days
2 2014-03-15 02:35:51  2014-05-14 60 days
3 2014-03-20 05:55:47  2014-05-14 55 days
4 2014-03-26 04:56:33  2014-05-14 49 days
like image 111
Karl D. Avatar answered Oct 23 '22 03:10

Karl D.