Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting string date to epoch time not working with Cython and POSIX C libraries

I have a very large pandas dataframe and I would like to create a column that contains the time in seconds since the epoch for a ISO-8601 format date string.

I originally used the standard Python libraries for this but the result is quite slow. I have tried to replace this by using the POSIX c library functions strptime and mktime directly but have not been able to get the right answer for the time conversion.

Here is the code (to be run in an IPython window)

%load_ext cythonmagic

%%cython
from posix.types cimport time_t
cimport numpy as np
import numpy as np
import time
cdef extern from "sys/time.h" nogil:
    struct tm:
        int tm_sec
        int tm_min
        int tm_hour
        int tm_mday
        int tm_mon
        int tm_year
        int tm_wday
        int tm_yday
        int tm_isdst
    time_t mktime(tm *timeptr)
    char *strptime(const char *s, const char *format, tm *tm)
cdef to_epoch_c(const char *date_text):
    cdef tm time_val
    strptime(date_text, "%Y-%m-%d", &time_val)
    return <unsigned int>mktime(&time_val)
cdef to_epoch_py(const char *date_text):
    return np.uint32(time.mktime(time.strptime(date_text, "%Y-%m-%d")))
cpdef np.ndarray[unsigned int] apply_epoch_date_c(np.ndarray col_date):
    cdef Py_ssize_t i, n = len(col_date)
    cdef np.ndarray[unsigned int] res = np.empty(n, dtype=np.uint32)
    for i in range(len(col_date)):
        res[i] = to_epoch_c(col_date[i])
    return res
cpdef np.ndarray[unsigned int] apply_epoch_date_py(np.ndarray col_date):
    cdef Py_ssize_t i, n = len(col_date)
    cdef np.ndarray[unsigned int] res = np.empty(n, dtype=np.uint32)
    for i in range(len(col_date)):
        res[i] = to_epoch_py(col_date[i])
    return res

The struct created by strptime does not look right to me the hours, minutes and seconds values are way too big, removing them or setting them to 0 does not seem to get the answer I'm looking for.

Here is a small test df which shows the values are not right for the c method:

from pandas import DataFrame
test = DataFrame({'date_text':["2015-05-18" for i in range(3)]}, dtype=np.uint32)

apply_epoch_date_py(test['date_text'].values)
Output: array([1431903600, 1431903600, 1431903600], dtype=uint32)
apply_epoch_date_c(test['date_text'].values)
Output: array([4182545380, 4182617380, 4182602980], dtype=uint32)

I don't get why the values for the c version are not always the same and so far off what they should be. I hope the mistake is reasonably small because the time difference between these two on a large dataframe is substantial (I'm not sure how much less work the c version is doing now as it is not working as expected)

test_large = DataFrame({'date_text':["2015-05-18" for i in range(int(10e6))]}, dtype=np.uint32)
%timeit -n 1 -r 1 apply_epoch_date_py(test_large['date_text'].values)
Output: 1 loops, best of 1: 1min 58s per loop
%timeit apply_epoch_date_c(test_large['date_text'].values)
Output: 1 loops, best of 3: 5.59 s per loop

I have looked up this cython time.h post and a general c unix time from string creation post which may be useful to someone answering.

My main question therefore is about the function to_epoch_c why is this function producing incorrect values? Thanks

Update:

The method from @Jeff is indeed the fastest and simplest approach to solving this problem using pandas.

The performance of the strptime/mktime in Python in poor in comparison to the other methods. The other Python based method mentioned here is much faster. Running the conversion for all the methods mentioned in this post (plus pd.to_datetime with string format given) provides interesting results. Pandas with infer_datetime_format is easily the fastest, scaling very well. Somewhat unintuitively if you tell pandas what the date format is it is much slower.

Performance Comparison

Profile comparison of both pandas methods:

%prun -l 3 pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
352 function calls (350 primitive calls) in 0.021 seconds
Ordered by: internal time
List reduced from 96 to 3 due to restriction <3>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.013    0.013    0.013    0.013 {pandas.tslib.array_to_datetime}
    1    0.005    0.005    0.005    0.005 {pandas.lib.isnullobj}
    1    0.001    0.001    0.021    0.021 <string>:1(<module>)

%prun -l 3 pd.to_datetime(df['date_text'],format="%Y-%m-%d", box=False).values.view('i8')/10**9
109 function calls (107 primitive calls) in 0.253 seconds

Ordered by: internal time
List reduced from 55 to 3 due to restriction <3>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.251    0.251    0.251    0.251 {pandas.tslib.array_strptime}
    1    0.001    0.001    0.253    0.253 <string>:1(<module>)
    1    0.000    0.000    0.252    0.252 tools.py:176(to_datetime)
like image 685
dsimmie Avatar asked May 19 '15 15:05

dsimmie


2 Answers

It seems if you don't pass in time_val.tm_hour, time_val.tm_min and time_val.tm_sec the date is parsed incorrectly, setting the values to 0 will return the correct timestamp:

cdef extern from "sys/time.h" nogil:
    struct tm:
        int    tm_sec   #Seconds [0,60].
        int    tm_min   #Minutes [0,59].
        int    tm_hour  #Hour [0,23].
        int    tm_mday  #Day of month [1,31].
        int    tm_mon   #Month of year [0,11].
        int    tm_year  #Years since 1900.
        int    tm_wday  #Day of week [0,6] (Sunday =0).
        int    tm_yday  #Day of year [0,365].
        int    tm_isdst #Daylight Savings
    time_t mktime(tm *timeptr)
    char *strptime(const char *s, const char *format, tm *tm)
cdef to_epoch_c(const char *date_text):
    cdef tm time_val
    time_val.tm_hour,  time_val.tm_min,  time_val.tm_sec= 0, 0, 0
    strptime(date_text, "%Y-%m-%d", &time_val)
    return  <unsigned int>mktime(&time_val)

If you print(time.strptime(date_text, "%Y-%m-%d")) you see python has the values set to 0 if you don't pass them to strptime:

 time.struct_time(tm_year=2015, tm_mon=5, tm_mday=18, tm_hour=12, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=138, tm_isdst=-1)

Setting the values to a default of 0 in to_epoch_c also returns 0:

{'tm_sec': 0, 'tm_hour': 0, 'tm_mday': 18, 'tm_isdst': 1, 'tm_year': 115, 'tm_mon': 4, 'tm_yday': 137, 'tm_wday': 1, 'tm_min': 0}

If you don't set them in returns random timestamps because there seem to be various values for tm_sec etc..:

 {'tm_sec': -1437999996, 'tm_hour': 0, 'tm_mday': 0, 'tm_isdst': -1438000080, 'tm_year': 32671, 'tm_mon': -1412460224, 'tm_yday': 0, 'tm_wday': 5038405, 'tm_min': 32671}
{'tm_sec': -1437999996, 'tm_hour': 4, 'tm_mday': 14, 'tm_isdst': 0, 'tm_year': 69, 'tm_mon': 10, 'tm_yday': 317, 'tm_wday': 5, 'tm_min': 32671}
{'tm_sec': -1437999996, 'tm_hour': 9, 'tm_mday': 14, 'tm_isdst': 0, 'tm_year': 69, 'tm_mon': 10, 'tm_yday': 317, 'tm_wday': 5, 'tm_min': 32671}

I imagine that maybe python handles when you don't pass them in somewhat similar but I did not look at the source yet so maybe someone more experienced in c will confirm.

If you try to pass less than 9 elements to time.time_struct you will get an error which somewhat confirms what I thought:

In [60]: import time  
In [61]: struct = time.struct_time((2015, 6, 18))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-61-ee40483c37d4> in <module>()
----> 1 struct = time.struct_time((2015, 6, 18))

TypeError: time.struct_time() takes a 9-sequence (3-sequence given)

You must pass a sequence of 9 elements:

In [63]: struct = time.struct_time((2015, 6, 18, 0, 0, 0, 0, 0, 0))    
In [64]: struct
Out[65]: time.struct_time(tm_year=2015, tm_mon=6, tm_mday=18, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=0, tm_isdst=0)

Anyway with the changes you get the same behaviour in both:

In [16]: import pandas as pd

In [17]: import numpy as np

In [18]: test = pd.DataFrame({'date_text' : ["2015-05-18" for i in range(3)]}, dtype=np.uint32)

In [19]: apply_epoch_date_c(test['date_text'].values)
Out[19]: array([1431903600, 1431903600, 1431903600], dtype=uint32)

In [20]: apply_epoch_date_py(test['date_text'].values)
Out[20]: array([1431903600, 1431903600, 1431903600], dtype=uint32)

Some tests on every date since 1970-1-1 shows both return the same timestamps:

In [55]: from datetime import datetime, timedelta

In [56]: tests = np.array([(datetime.strptime("1970-1-1","%Y-%m-%d")+timedelta(i)).strftime("%Y-%m-%d") for i in range(16604)])

In [57]: a = apply_epoch_date_c( tests)

In [58]: b = apply_epoch_date_py( tests)

In [59]: for d1,d2 in zip(a,b):
             assert d1 == d1
   ....:     

In [60]: 

Timing both implementations the cython code seems to indeed be quite a bit more efficient:

In [21]: timeit apply_epoch_date_py(test['date_text'].values)
10000 loops, best of 3: 73 µs per loop

In [22]: timeit apply_epoch_date_c(test['date_text'].values)
100000 loops, best of 3: 10.8 µs per loop
like image 70
Padraic Cunningham Avatar answered Sep 30 '22 10:09

Padraic Cunningham


Simple pure pandas method of doing this. Dates are stored natively as i8 (in ns since epoch).

In [30]: df = DataFrame({'date_text':["2015-05-18" for i in range(int(10e6))]}, dtype=np.uint32)

In [31]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Data columns (total 1 columns):
date_text    object
dtypes: object(1)
memory usage: 152.6+ MB

In [32]: pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
Out[32]: 
array([1431907200, 1431907200, 1431907200, ..., 1431907200, 1431907200,
       1431907200])

In [33]: %timeit pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
1 loops, best of 3: 1.96 s per loop
like image 43
Jeff Avatar answered Sep 30 '22 11:09

Jeff