I have enormous files that look like this:
05/31/2012,15:30:00.029,1306.25,1,E,0,,1306.25
05/31/2012,15:30:00.029,1306.25,8,E,0,,1306.25
I can easily read them using the following:
pd.read_csv(gzip.open("myfile.gz"), header=None,names=
["date","time","price","size","type","zero","empty","last"], parse_dates=[[0,1]])
Is there any way to efficiently parse dates like this into pandas timestamps? If not, is there any guide for writing a cython function that can passed to date_parser= ?
I tried writing my own parser function and it still takes too long for the project I am working on.
An improvement of previous solution of Michael WS:
pandas.Timestamp
is better to perform outside the Cython codeatoi
and processing native-c strings is a little-bit faster than python funcsdatetime
-lib calls is reduced to one from 2 (+1 occasional for date)NB! The date order in this code is day/month/year.
All in all the code seems to be approximately 10 times faster than the original convert_date_cython
. However if this is called after read_csv
then on SSD hard drive the difference is total time is only few percents due to the reading overhead. I would guess that on regular HDD the difference would be even smaller.
cimport numpy as np
import datetime
import numpy as np
import pandas as pd
from libc.stdlib cimport atoi, malloc, free
from libc.string cimport strcpy
### Modified code from Michael WS:
### https://stackoverflow.com/a/15812787/2447082
def convert_date_fast(np.ndarray date_vec, np.ndarray time_vec):
cdef int i, d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms
cdef int N = len(date_vec)
cdef np.ndarray out_ar = np.empty(N, dtype=np.object)
cdef bytes prev_date = <bytes> 'xx/xx/xxxx'
cdef char *date_str = <char *> malloc(20)
cdef char *time_str = <char *> malloc(20)
for i in range(N):
if date_vec[i] != prev_date:
prev_date = date_vec[i]
strcpy(date_str, prev_date) ### xx/xx/xxxx
date_str[2] = 0
date_str[5] = 0
d_year = atoi(date_str+6)
d_month = atoi(date_str+3)
d_day = atoi(date_str)
strcpy(time_str, time_vec[i]) ### xx:xx:xx:xxxxxx
time_str[2] = 0
time_str[5] = 0
time_str[8] = 0
t_hour = atoi(time_str)
t_min = atoi(time_str+3)
t_sec = atoi(time_str+6)
t_ms = atoi(time_str+9)
out_ar[i] = datetime.datetime(d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms)
free(date_str)
free(time_str)
return pd.to_datetime(out_ar)
I got an incredible speedup (50X) with the following cython code:
call from python: timestamps = convert_date_cython(df["date"].values, df["time"].values)
cimport numpy as np
import pandas as pd
import datetime
import numpy as np
def convert_date_cython(np.ndarray date_vec, np.ndarray time_vec):
cdef int i
cdef int N = len(date_vec)
cdef out_ar = np.empty(N, dtype=np.object)
date = None
for i in range(N):
if date is None or date_vec[i] != date_vec[i - 1]:
dt_ar = map(int, date_vec[i].split("/"))
date = datetime.date(dt_ar[2], dt_ar[0], dt_ar[1])
time_ar = map(int, time_vec[i].split(".")[0].split(":"))
time = datetime.time(time_ar[0], time_ar[1], time_ar[2])
out_ar[i] = pd.Timestamp(datetime.datetime.combine(date, time))
return out_ar
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With