Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert list of datestrings to datetime very slow with Python strptime

Tags:

python

numpy

I have data files containing lists of strings representing ISO formatted dates. Currently, I am reading them in using:

mydates = [ datetime.datetime.strptime(timdata[x], "%Y-%m-%dT%H:%M:%S") for x in range(len(timedata)) ]

This looks quite straightforward, but is ridiculously slow when operating on huge lists of ~25000 dates -> about 0.34 seconds per converted list. Since I have thousands of such lists I am looking for a faster way. However, I could not find one yet. The dateutil parser performs even worse...

like image 653
HyperCube Avatar asked Jan 04 '13 19:01

HyperCube


People also ask

What is the difference between Strftime and Strptime?

strptime is short for "parse time" where strftime is for "formatting time". That is, strptime is the opposite of strftime though they use, conveniently, the same formatting specification.

How do you convert a string to a datetime object in Python?

We can convert a string to datetime using strptime() function. This function is available in datetime and time modules to parse a string to datetime and time objects respectively.

How do I use Strftime and Strptime in Python?

Python time strptime() MethodThe format parameter uses the same directives as those used by strftime(); it defaults to "%a %b %d %H:%M:%S %Y" which matches the formatting returned by ctime(). If string cannot be parsed according to format, or if it has excess data after parsing, ValueError is raised.

What does datetime Strptime do in Python?

The strptime() function in Python is used to format and return a string representation of date and time. It takes in the date, time, or both as an input, and parses it according to the directives given to it. It raises ValueError if the string cannot be formatted according to the provided directives.


2 Answers

Here is a way to do it about 3x faster.

The original version:

In [23]: %timeit datetime.datetime.strptime("2013-01-01T01:23:45", "%Y-%m-%dT%H:%M:%S")
10000 loops, best of 3: 21.8 us per loop

The faster version:

In [24]: p = re.compile('[-T:]')

In [26]: %timeit datetime.datetime(*map(int, p.split("2013-01-01T01:23:45")))
100000 loops, best of 3: 7.28 us per loop

This is obviously nowhere near as flexible as strptime().

edit: Using a single regex to extract the date components is marginally faster:

In [48]: pp = re.compile(r'(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2})')

In [49]: %timeit datetime.datetime(*map(int, pp.match("2013-01-01T01:23:45").groups()))
100000 loops, best of 3: 6.92 us per loop
like image 198
NPE Avatar answered Sep 16 '22 12:09

NPE


Indexing / slicing seems to be faster than the regex used by @NPE:

In [47]: def with_indexing(dstr):                              
   ....:     return datetime.datetime(*map(int, [dstr[:4], dstr[5:7], dstr[8:10],
   ....:                               dstr[11:13], dstr[14:16], dstr[17:]])) 

In [48]: p = re.compile('[-T:]')

In [49]: def with_regex(dt_str):
   ....:     return datetime.datetime(*map(int, p.split(dt_str)))

In [50]: %timeit with_regex(dstr)
100000 loops, best of 3: 3.84 us per loop

In [51]: %timeit with_indexing(dstr)
100000 loops, best of 3: 2.98 us per loop

I think if you would use a file parser like numpy.genfromtxt, the converters argument and a fast string parsing method you can read and parse a whole file in less than a half second.

I used the following function to create an example file with about 25000 rows, ISO date strings as index and 10 data columns:

import numpy as np
import pandas as pd

def create_data():
    # create dates
    dates = pd.date_range('2010-01-01T00:30', '2013-01-04T23:30', freq='H')
    # convert to iso
    iso_dates = dates.map(lambda x: x.strftime('%Y-%m-%dT%H:%M:%S'))
    # create data
    data = pd.DataFrame(np.random.random((iso_dates.size, 10)) * 100,
                        index=iso_dates)
    # write to file
    data.to_csv('dates.csv', header=False)

Than I used the following code to parse the file:

In [54]: %timeit a = np.genfromtxt('dates.csv', delimiter=',',
                                   converters={0:with_regex})
1 loops, best of 3: 430 ms per loop

In [55]: %timeit a = np.genfromtxt('dates.csv', delimiter=',',
                                   converters={0:with_indexing})
1 loops, best of 3: 391 ms per loop

pandas (based on numpy) has a C-based file parser which is even faster:

In [56]: %timeit df = pd.read_csv('dates.csv', header=None, index_col=0, 
                                  parse_dates=True, date_parser=with_indexing)
10 loops, best of 3: 167 ms per loop
like image 23
bmu Avatar answered Sep 20 '22 12:09

bmu