I have data files containing lists of strings representing ISO formatted dates. Currently, I am reading them in using:
mydates = [ datetime.datetime.strptime(timdata[x], "%Y-%m-%dT%H:%M:%S") for x in range(len(timedata)) ]
This looks quite straightforward, but is ridiculously slow when operating on huge lists of ~25000 dates -> about 0.34 seconds per converted list. Since I have thousands of such lists I am looking for a faster way. However, I could not find one yet. The dateutil parser performs even worse...
strptime is short for "parse time" where strftime is for "formatting time". That is, strptime is the opposite of strftime though they use, conveniently, the same formatting specification.
We can convert a string to datetime using strptime() function. This function is available in datetime and time modules to parse a string to datetime and time objects respectively.
Python time strptime() MethodThe format parameter uses the same directives as those used by strftime(); it defaults to "%a %b %d %H:%M:%S %Y" which matches the formatting returned by ctime(). If string cannot be parsed according to format, or if it has excess data after parsing, ValueError is raised.
The strptime() function in Python is used to format and return a string representation of date and time. It takes in the date, time, or both as an input, and parses it according to the directives given to it. It raises ValueError if the string cannot be formatted according to the provided directives.
Here is a way to do it about 3x faster.
The original version:
In [23]: %timeit datetime.datetime.strptime("2013-01-01T01:23:45", "%Y-%m-%dT%H:%M:%S")
10000 loops, best of 3: 21.8 us per loop
The faster version:
In [24]: p = re.compile('[-T:]')
In [26]: %timeit datetime.datetime(*map(int, p.split("2013-01-01T01:23:45")))
100000 loops, best of 3: 7.28 us per loop
This is obviously nowhere near as flexible as strptime()
.
edit: Using a single regex to extract the date components is marginally faster:
In [48]: pp = re.compile(r'(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2})')
In [49]: %timeit datetime.datetime(*map(int, pp.match("2013-01-01T01:23:45").groups()))
100000 loops, best of 3: 6.92 us per loop
Indexing / slicing seems to be faster than the regex used by @NPE:
In [47]: def with_indexing(dstr):
....: return datetime.datetime(*map(int, [dstr[:4], dstr[5:7], dstr[8:10],
....: dstr[11:13], dstr[14:16], dstr[17:]]))
In [48]: p = re.compile('[-T:]')
In [49]: def with_regex(dt_str):
....: return datetime.datetime(*map(int, p.split(dt_str)))
In [50]: %timeit with_regex(dstr)
100000 loops, best of 3: 3.84 us per loop
In [51]: %timeit with_indexing(dstr)
100000 loops, best of 3: 2.98 us per loop
I think if you would use a file parser like numpy.genfromtxt
, the converters
argument and a fast string parsing method you can read and parse a whole file in less than a half second.
I used the following function to create an example file with about 25000 rows, ISO date strings as index and 10 data columns:
import numpy as np
import pandas as pd
def create_data():
# create dates
dates = pd.date_range('2010-01-01T00:30', '2013-01-04T23:30', freq='H')
# convert to iso
iso_dates = dates.map(lambda x: x.strftime('%Y-%m-%dT%H:%M:%S'))
# create data
data = pd.DataFrame(np.random.random((iso_dates.size, 10)) * 100,
index=iso_dates)
# write to file
data.to_csv('dates.csv', header=False)
Than I used the following code to parse the file:
In [54]: %timeit a = np.genfromtxt('dates.csv', delimiter=',',
converters={0:with_regex})
1 loops, best of 3: 430 ms per loop
In [55]: %timeit a = np.genfromtxt('dates.csv', delimiter=',',
converters={0:with_indexing})
1 loops, best of 3: 391 ms per loop
pandas (based on numpy) has a C-based file parser which is even faster:
In [56]: %timeit df = pd.read_csv('dates.csv', header=None, index_col=0,
parse_dates=True, date_parser=with_indexing)
10 loops, best of 3: 167 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With