I have a column with IDs and the time is encoded within. For example:
0 020160910223200_T1
1 020160910223200_T1
2 020160910223203_T1
3 020160910223203_T1
4 020160910223206_T1
5 020160910223206_T1
6 020160910223209_T1
7 020160910223209_T1
8 020160910223213_T1
9 020160910223213_T1
If we remove the first and the last three characters, we obtain for the first row: 20160910223200 which should be converted to 2016-09-10 22:32:00.
My solution was to write a function which truncates the IDs and transforms to a datetime. Then, I applied this function to my df column.
from datetime import datetime
def MeasureIDtoTime(MeasureID):
MeasureID = str(MeasureID)
MeasureID = MeasureID[1:14]
Time = datetime.strptime(MeasureID, '%Y%m%d%H%M%S')
return Time
df['Time'] = df['MeasureID'].apply(MeasureIDtoTime)
This works properly, however is slow for my case. I have to deal with more than 20 million rows, and I need a faster solution. Any idea for a more efficient solution?
Update
According to @MaxU there is a better solution:
pd.to_datetime(df.ID.str[1:-3], format = '%Y%m%d%H%M%S')
This does the job in 32 seconds for 7.2 million rows. However, in R thanks to lubridate::ymd_hms()
function, I performed the task in less then 2 seconds. So I am wondering if there exists a better solution for my problem in Python.
UPDATE: performance optimization...
Let's try to optimize it a little bit
DF shape: 50.000 x 1
In [220]: df.head()
Out[220]:
ID
0 020160910223200_T1
1 020160910223200_T1
2 020160910223203_T1
3 020160910223203_T1
4 020160910223206_T1
In [221]: df.shape
Out[221]: (50000, 1)
In [222]: len(df)
Out[222]: 50000
Timing:
In [223]: %timeit df['ID'].apply(MeasureIDtoTime)
1 loop, best of 3: 929 ms per loop
In [224]: %timeit pd.to_datetime(df.ID.str[1:-3])
1 loop, best of 3: 5.68 s per loop
In [225]: %timeit pd.to_datetime(df.ID.str[1:-3], format='%Y%m%d%H%M%S')
1 loop, best of 3: 267 ms per loop ### WINNER !
Conclusion: explicitly specifying the datetime format speeds it up 21 times.
NOTE: it's possible only if you have a constant datetime format
OLD answer:
In [81]: pd.to_datetime(df.ID.str[1:-3])
Out[81]:
0 2016-09-10 22:32:00
1 2016-09-10 22:32:00
2 2016-09-10 22:32:03
3 2016-09-10 22:32:03
4 2016-09-10 22:32:06
5 2016-09-10 22:32:06
6 2016-09-10 22:32:09
7 2016-09-10 22:32:09
8 2016-09-10 22:32:13
9 2016-09-10 22:32:13
Name: ID, dtype: datetime64[ns]
where df
is:
In [82]: df
Out[82]:
ID
0 020160910223200_T1
1 020160910223200_T1
2 020160910223203_T1
3 020160910223203_T1
4 020160910223206_T1
5 020160910223206_T1
6 020160910223209_T1
7 020160910223209_T1
8 020160910223213_T1
9 020160910223213_T1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With