Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine appropriate strftime format from a date string?

The dateutil parser does a great job of correctly guessing the date and time from a wide variety of sources.

We are processing files in which each file uses only one date/time format, but the format varies between files. Profiling shows a lot of time being used by dateutil.parser.parse. Since it only needs to be determined once per file, implementing something that isn't guessing the format each time could speed things up.

I don't actually know the formats in advance and I'll still need to infer the format. Something like:

from MysteryPackage import date_string_to_format_string import datetime  # e.g. mystring = '1 Jan 2016' myformat = None  ...  # somewhere in a loop reading from a file or connection: if myformat is None:     myformat = date_string_to_format_string(mystring)  # do the usual checks to see if that worked, then: mydatetime = datetime.strptime(mystring, myformat) 

Is there such a function?

like image 294
Jason Avatar asked Jun 02 '17 05:06

Jason


People also ask

What format is Strftime?

The strftime() method takes one or more format codes as an argument and returns a formatted string based on it. We imported datetime class from the datetime module. It's because the object of datetime class can access strftime() method. The datetime object containing current date and time is stored in now variable.

How do I validate a date string format in Python?

Method #1 : Using strptime() In this, the function, strptime usually used for conversion of string date to datetime object, is used as when it doesn't match the format or date, raises the ValueError, and hence can be used to compute for validity.

What format is this date string?

The string format should be: YYYY-MM-DDTHH:mm:ss. sssZ , where: YYYY-MM-DD – is the date: year-month-day.

What is date Strftime?

The strftime() function is used to convert date and time objects to their string representation. It takes one or more input of formatted code and returns the string representation. Syntax : strftime(format) Returns : It returns the string representation of the date or time object.


1 Answers

This is a tricky one. My approach makes use of regular expressions and the (?(DEFINE)...) syntax which is only supported by the newer regex module.


Essentially, DEFINE let us define subroutines prior to matching them, so first of all we define all needed bricks for our date guessing function:
    (?(DEFINE)         (?P<year_def>[12]\d{3})         (?P<year_short_def>\d{2})         (?P<month_def>January|February|March|April|May|June|         July|August|September|October|November|December)         (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)         (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))         (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)         (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)         (?P<hms_def>\d{2}:\d{2}:\d{2})         (?P<hm_def>\d{2}:\d{2})             (?P<ms_def>\d{5,6})             (?P<delim_def>([-/., ]+|(?<=\d|^)T))         )         # actually match them         (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|         (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)         """, re.VERBOSE) 

After this, we need to think of possible delimiters:

# delim delim = re.compile(r'([-/., ]+|(?<=\d)T)') 

Format mapping:

# formats formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''} 

The function GuessFormat() splits the parts with the help of the delimiters, tries to match them and outputs the corresponding code for strftime():

def GuessFormat(datestring):      # define the bricks     bricks = re.compile(r"""             (?(DEFINE)                 (?P<year_def>[12]\d{3})                 (?P<year_short_def>\d{2})                 (?P<month_def>January|February|March|April|May|June|                 July|August|September|October|November|December)                 (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)                 (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))                 (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)                 (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)                 (?P<hms_def>T?\d{2}:\d{2}:\d{2})                 (?P<hm_def>T?\d{2}:\d{2})                 (?P<ms_def>\d{5,6})                 (?P<delim_def>([-/., ]+|(?<=\d|^)T))             )             # actually match them             (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|             (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)             """, re.VERBOSE)      # delim     delim = re.compile(r'([-/., ]+|(?<=\d)T)')      # formats     formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}      parts = delim.split(datestring)     out = []     for index, part in enumerate(parts):         try:             brick = dict(filter(lambda x: x[1] is not None, bricks.match(part).groupdict().items()))             key = next(iter(brick))              # ambiguities             if key == 'day' and index == 2:                 key = 'month_dec'              item = part if key == 'delim' else formats[key]             out.append(item)         except AttributeError:             out.append(part)      return "".join(out) 

A test in the end:

import regex as re  datestrings = [datetime.now().isoformat(), '2006-11-02', 'Thursday, 10 August 2006 08:42:51', 'August 9, 1995', 'Aug 9, 1995', 'Thu, 01 Jan 1970 00:00:00', '21/11/06 16:30',  '06 Jun 2017 20:33:10']  # test for dt in datestrings:     print("Date: {}, Format: {}".format(dt, GuessFormat(dt))) 

This yields:

Date: 2017-06-07T22:02:05.001811, Format: %Y-%m-%dT%H:%M:%S.%f Date: 2006-11-02, Format: %Y-%m-%d Date: Thursday, 10 August 2006 08:42:51, Format: %A, %m %B %Y %H:%M:%S Date: August 9, 1995, Format: %B %m, %Y Date: Aug 9, 1995, Format: %b %m, %Y Date: Thu, 01 Jan 1970 00:00:00, Format: %a, %m %b %Y %H:%M:%S Date: 21/11/06 16:30, Format: %d/%m/%d %H:%M Date: 06 Jun 2017 20:33:10, Format: %d %b %Y %H:%M:%S 
like image 132
Jan Avatar answered Oct 08 '22 02:10

Jan