I have a very large pandas dataframe and I would like to create a column that contains the time in seconds since the epoch for a ISO-8601 format date string.
I originally used the standard Python libraries for this but the result is quite slow. I have tried to replace this by using the POSIX c library functions strptime
and mktime
directly but have not been able to get the right answer for the time conversion.
Here is the code (to be run in an IPython window)
%load_ext cythonmagic
%%cython
from posix.types cimport time_t
cimport numpy as np
import numpy as np
import time
cdef extern from "sys/time.h" nogil:
struct tm:
int tm_sec
int tm_min
int tm_hour
int tm_mday
int tm_mon
int tm_year
int tm_wday
int tm_yday
int tm_isdst
time_t mktime(tm *timeptr)
char *strptime(const char *s, const char *format, tm *tm)
cdef to_epoch_c(const char *date_text):
cdef tm time_val
strptime(date_text, "%Y-%m-%d", &time_val)
return <unsigned int>mktime(&time_val)
cdef to_epoch_py(const char *date_text):
return np.uint32(time.mktime(time.strptime(date_text, "%Y-%m-%d")))
cpdef np.ndarray[unsigned int] apply_epoch_date_c(np.ndarray col_date):
cdef Py_ssize_t i, n = len(col_date)
cdef np.ndarray[unsigned int] res = np.empty(n, dtype=np.uint32)
for i in range(len(col_date)):
res[i] = to_epoch_c(col_date[i])
return res
cpdef np.ndarray[unsigned int] apply_epoch_date_py(np.ndarray col_date):
cdef Py_ssize_t i, n = len(col_date)
cdef np.ndarray[unsigned int] res = np.empty(n, dtype=np.uint32)
for i in range(len(col_date)):
res[i] = to_epoch_py(col_date[i])
return res
The struct created by strptime
does not look right to me the hours, minutes and seconds values are way too big, removing them or setting them to 0 does not seem to get the answer I'm looking for.
Here is a small test df which shows the values are not right for the c method:
from pandas import DataFrame
test = DataFrame({'date_text':["2015-05-18" for i in range(3)]}, dtype=np.uint32)
apply_epoch_date_py(test['date_text'].values)
Output: array([1431903600, 1431903600, 1431903600], dtype=uint32)
apply_epoch_date_c(test['date_text'].values)
Output: array([4182545380, 4182617380, 4182602980], dtype=uint32)
I don't get why the values for the c version are not always the same and so far off what they should be. I hope the mistake is reasonably small because the time difference between these two on a large dataframe is substantial (I'm not sure how much less work the c version is doing now as it is not working as expected)
test_large = DataFrame({'date_text':["2015-05-18" for i in range(int(10e6))]}, dtype=np.uint32)
%timeit -n 1 -r 1 apply_epoch_date_py(test_large['date_text'].values)
Output: 1 loops, best of 1: 1min 58s per loop
%timeit apply_epoch_date_c(test_large['date_text'].values)
Output: 1 loops, best of 3: 5.59 s per loop
I have looked up this cython time.h post and a general c unix time from string creation post which may be useful to someone answering.
My main question therefore is about the function to_epoch_c
why is this function producing incorrect values? Thanks
Update:
The method from @Jeff is indeed the fastest and simplest approach to solving this problem using pandas.
The performance of the strptime/mktime in Python in poor in comparison to the other methods. The other Python based method mentioned here is much faster. Running the conversion for all the methods mentioned in this post (plus pd.to_datetime
with string format given) provides interesting results. Pandas with infer_datetime_format is easily the fastest, scaling very well. Somewhat unintuitively if you tell pandas what the date format is it is much slower.
Profile comparison of both pandas methods:
%prun -l 3 pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
352 function calls (350 primitive calls) in 0.021 seconds
Ordered by: internal time
List reduced from 96 to 3 due to restriction <3>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.013 0.013 0.013 0.013 {pandas.tslib.array_to_datetime}
1 0.005 0.005 0.005 0.005 {pandas.lib.isnullobj}
1 0.001 0.001 0.021 0.021 <string>:1(<module>)
%prun -l 3 pd.to_datetime(df['date_text'],format="%Y-%m-%d", box=False).values.view('i8')/10**9
109 function calls (107 primitive calls) in 0.253 seconds
Ordered by: internal time
List reduced from 55 to 3 due to restriction <3>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.251 0.251 0.251 0.251 {pandas.tslib.array_strptime}
1 0.001 0.001 0.253 0.253 <string>:1(<module>)
1 0.000 0.000 0.252 0.252 tools.py:176(to_datetime)
It seems if you don't pass in time_val.tm_hour, time_val.tm_min
and time_val.tm_sec
the date is parsed incorrectly, setting the values to 0
will return the correct timestamp:
cdef extern from "sys/time.h" nogil:
struct tm:
int tm_sec #Seconds [0,60].
int tm_min #Minutes [0,59].
int tm_hour #Hour [0,23].
int tm_mday #Day of month [1,31].
int tm_mon #Month of year [0,11].
int tm_year #Years since 1900.
int tm_wday #Day of week [0,6] (Sunday =0).
int tm_yday #Day of year [0,365].
int tm_isdst #Daylight Savings
time_t mktime(tm *timeptr)
char *strptime(const char *s, const char *format, tm *tm)
cdef to_epoch_c(const char *date_text):
cdef tm time_val
time_val.tm_hour, time_val.tm_min, time_val.tm_sec= 0, 0, 0
strptime(date_text, "%Y-%m-%d", &time_val)
return <unsigned int>mktime(&time_val)
If you print(time.strptime(date_text, "%Y-%m-%d"))
you see python has the values set to 0
if you don't pass them to strptime:
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=18, tm_hour=12, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=138, tm_isdst=-1)
Setting the values to a default of 0
in to_epoch_c
also returns 0
:
{'tm_sec': 0, 'tm_hour': 0, 'tm_mday': 18, 'tm_isdst': 1, 'tm_year': 115, 'tm_mon': 4, 'tm_yday': 137, 'tm_wday': 1, 'tm_min': 0}
If you don't set them in returns random timestamps because there seem to be various values for tm_sec
etc..:
{'tm_sec': -1437999996, 'tm_hour': 0, 'tm_mday': 0, 'tm_isdst': -1438000080, 'tm_year': 32671, 'tm_mon': -1412460224, 'tm_yday': 0, 'tm_wday': 5038405, 'tm_min': 32671}
{'tm_sec': -1437999996, 'tm_hour': 4, 'tm_mday': 14, 'tm_isdst': 0, 'tm_year': 69, 'tm_mon': 10, 'tm_yday': 317, 'tm_wday': 5, 'tm_min': 32671}
{'tm_sec': -1437999996, 'tm_hour': 9, 'tm_mday': 14, 'tm_isdst': 0, 'tm_year': 69, 'tm_mon': 10, 'tm_yday': 317, 'tm_wday': 5, 'tm_min': 32671}
I imagine that maybe python handles when you don't pass them in somewhat similar but I did not look at the source yet so maybe someone more experienced in c
will confirm.
If you try to pass less than 9 elements to time.time_struct
you will get an error which somewhat confirms what I thought:
In [60]: import time
In [61]: struct = time.struct_time((2015, 6, 18))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-61-ee40483c37d4> in <module>()
----> 1 struct = time.struct_time((2015, 6, 18))
TypeError: time.struct_time() takes a 9-sequence (3-sequence given)
You must pass a sequence of 9 elements:
In [63]: struct = time.struct_time((2015, 6, 18, 0, 0, 0, 0, 0, 0))
In [64]: struct
Out[65]: time.struct_time(tm_year=2015, tm_mon=6, tm_mday=18, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=0, tm_isdst=0)
Anyway with the changes you get the same behaviour in both:
In [16]: import pandas as pd
In [17]: import numpy as np
In [18]: test = pd.DataFrame({'date_text' : ["2015-05-18" for i in range(3)]}, dtype=np.uint32)
In [19]: apply_epoch_date_c(test['date_text'].values)
Out[19]: array([1431903600, 1431903600, 1431903600], dtype=uint32)
In [20]: apply_epoch_date_py(test['date_text'].values)
Out[20]: array([1431903600, 1431903600, 1431903600], dtype=uint32)
Some tests on every date since 1970-1-1 shows both return the same timestamps:
In [55]: from datetime import datetime, timedelta
In [56]: tests = np.array([(datetime.strptime("1970-1-1","%Y-%m-%d")+timedelta(i)).strftime("%Y-%m-%d") for i in range(16604)])
In [57]: a = apply_epoch_date_c( tests)
In [58]: b = apply_epoch_date_py( tests)
In [59]: for d1,d2 in zip(a,b):
assert d1 == d1
....:
In [60]:
Timing both implementations the cython code seems to indeed be quite a bit more efficient:
In [21]: timeit apply_epoch_date_py(test['date_text'].values)
10000 loops, best of 3: 73 µs per loop
In [22]: timeit apply_epoch_date_c(test['date_text'].values)
100000 loops, best of 3: 10.8 µs per loop
Simple pure pandas method of doing this. Dates are stored natively as i8 (in ns since epoch).
In [30]: df = DataFrame({'date_text':["2015-05-18" for i in range(int(10e6))]}, dtype=np.uint32)
In [31]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Data columns (total 1 columns):
date_text object
dtypes: object(1)
memory usage: 152.6+ MB
In [32]: pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
Out[32]:
array([1431907200, 1431907200, 1431907200, ..., 1431907200, 1431907200,
1431907200])
In [33]: %timeit pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
1 loops, best of 3: 1.96 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With