I am trying to do the following but is seems that vectorized operations in this mode are not supported.
import pandas as pd
df=pd.DataFrame([[2017,1,15,1],
[2017,1,15,2],
[2017,1,15,3],
[2017,1,15,4],
[2017,1,15,5],
[2017,1,15,6],
[2017,1,15,7]],
columns=['year','month','day','month_offset'])
df['date']=df.apply(lambda g: pd.datetime(g.year,g.month,g.day),axis=1)
df['offset']=df.apply(lambda g: pd.offsets.MonthEnd(g.month_offset),axis=1)
df['date_offset']=df.date+df.offset
This is the warning returned for last statement in the code snippet:
C:\Python3.5.2.3\WinPython-64bit-3.5.2.3\python-3.5.2.amd64\lib\site-packages\pandas\core\ops.py:533: PerformanceWarning: Adding/subtracting array of DateOffsets to Series not vectorized "Series not vectorized", PerformanceWarning)
I would like to this to work as a vectorized operation because of the performance benefits.
Thanks.
To end, comparison of methods following on from @john-zwinck:
import time
import pandas as pd
import numpy as np
df=pd.DataFrame([[2017,1,1,1],
[2017,1,1,2],
[2017,1,1,3],
[2017,1,1,4],
[2017,1,1,5],
[2017,1,1,6],
[2017,1,1,7]],
columns=['year','month','day','month_offset'])
df['mydate']=df.apply(lambda g:
pd.datetime(g.year,g.month,g.day),axis=1)
start_time=time.time()
df['pandas_offset']=df.apply(lambda g: g.mydate +
pd.offsets.MonthEnd(g.month_offset),axis=1)
end_time=time.time()
print('Method1 {} seconds'.format(end_time-start_time))
start_time=time.time()
df['numpy_offset']=(df.mydate.values.astype('M8[M]')+
df.month_offset.values * np.timedelta64(1, 'M')).astype('M8[D]') -
np.timedelta64(1, 'D')
end_time=time.time()
print('Method3 with numpy vectorization {} seconds'.format(end_time-
start_time))
The result:
index year month day month_offset mydate offset1 final
0 2017 1 1 1 2017-01-01 2017-01-31 2017-01-31
1 2017 1 1 2 2017-01-01 2017-02-28 2017-02-28
2 2017 1 1 3 2017-01-01 2017-03-31 2017-03-31
3 2017 1 1 4 2017-01-01 2017-04-30 2017-04-30
4 2017 1 1 5 2017-01-01 2017-05-31 2017-05-31
5 2017 1 1 6 2017-01-01 2017-06-30 2017-06-30
6 2017 1 1 7 2017-01-01 2017-07-31 2017-07-31
runfile('C:/bitbucket/test/vector_dates.py', wdir='C:/bitbucket/test')
Method 1 0.003999948501586914 seconds
Method 2 with numpy vectorization 0.0009999275207519531 seconds
Clearly numpy much faster
A truly vectorized way to do this is to construct an array of numpy.timedelta64
from month_offset
, add this to the array of dates, then subtract numpy.timedelta64(1, 'D')
to go back to the last day of the previous month.
Solutions using apply(lambda)
are likely to be much slower. And as the warning said, some Pandas date offset operations are not vectorized. If your data are large, it's better to avoid them. The NumPy facilities like busday_offset()
and timedelta64
are fully performant.
Consider the following approach:
In [94]: df['date'] = pd.to_datetime(df[['year','month','day']])
In [95]: df['date_offset'] = df.apply(lambda x: x['date'] + pd.offsets.MonthEnd(x['month_offset']), axis=1)
In [96]: df
Out[96]:
year month day month_offset date date_offset
0 2017 1 15 1 2017-01-15 2017-01-31
1 2017 1 15 2 2017-01-15 2017-02-28
2 2017 1 15 3 2017-01-15 2017-03-31
3 2017 1 15 4 2017-01-15 2017-04-30
4 2017 1 15 5 2017-01-15 2017-05-31
5 2017 1 15 6 2017-01-15 2017-06-30
6 2017 1 15 7 2017-01-15 2017-07-31
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With