The dateutil
parser does a great job of correctly guessing the date and time from a wide variety of sources.
We are processing files in which each file uses only one date/time format, but the format varies between files. Profiling shows a lot of time being used by dateutil.parser.parse
. Since it only needs to be determined once per file, implementing something that isn't guessing the format each time could speed things up.
I don't actually know the formats in advance and I'll still need to infer the format. Something like:
from MysteryPackage import date_string_to_format_string import datetime # e.g. mystring = '1 Jan 2016' myformat = None ... # somewhere in a loop reading from a file or connection: if myformat is None: myformat = date_string_to_format_string(mystring) # do the usual checks to see if that worked, then: mydatetime = datetime.strptime(mystring, myformat)
Is there such a function?
The strftime() method takes one or more format codes as an argument and returns a formatted string based on it. We imported datetime class from the datetime module. It's because the object of datetime class can access strftime() method. The datetime object containing current date and time is stored in now variable.
Method #1 : Using strptime() In this, the function, strptime usually used for conversion of string date to datetime object, is used as when it doesn't match the format or date, raises the ValueError, and hence can be used to compute for validity.
The string format should be: YYYY-MM-DDTHH:mm:ss. sssZ , where: YYYY-MM-DD – is the date: year-month-day.
The strftime() function is used to convert date and time objects to their string representation. It takes one or more input of formatted code and returns the string representation. Syntax : strftime(format) Returns : It returns the string representation of the date or time object.
This is a tricky one. My approach makes use of regular expressions and the (?(DEFINE)...)
syntax which is only supported by the newer regex
module.
DEFINE
let us define subroutines prior to matching them, so first of all we define all needed bricks for our date guessing function: (?(DEFINE) (?P<year_def>[12]\d{3}) (?P<year_short_def>\d{2}) (?P<month_def>January|February|March|April|May|June| July|August|September|October|November|December) (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01])) (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day) (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun) (?P<hms_def>\d{2}:\d{2}:\d{2}) (?P<hm_def>\d{2}:\d{2}) (?P<ms_def>\d{5,6}) (?P<delim_def>([-/., ]+|(?<=\d|^)T)) ) # actually match them (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)| (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$) """, re.VERBOSE)
After this, we need to think of possible delimiters:
# delim delim = re.compile(r'([-/., ]+|(?<=\d)T)')
Format mapping:
# formats formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}
The function GuessFormat()
splits the parts with the help of the delimiters, tries to match them and outputs the corresponding code for strftime()
:
def GuessFormat(datestring): # define the bricks bricks = re.compile(r""" (?(DEFINE) (?P<year_def>[12]\d{3}) (?P<year_short_def>\d{2}) (?P<month_def>January|February|March|April|May|June| July|August|September|October|November|December) (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01])) (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day) (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun) (?P<hms_def>T?\d{2}:\d{2}:\d{2}) (?P<hm_def>T?\d{2}:\d{2}) (?P<ms_def>\d{5,6}) (?P<delim_def>([-/., ]+|(?<=\d|^)T)) ) # actually match them (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)| (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$) """, re.VERBOSE) # delim delim = re.compile(r'([-/., ]+|(?<=\d)T)') # formats formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''} parts = delim.split(datestring) out = [] for index, part in enumerate(parts): try: brick = dict(filter(lambda x: x[1] is not None, bricks.match(part).groupdict().items())) key = next(iter(brick)) # ambiguities if key == 'day' and index == 2: key = 'month_dec' item = part if key == 'delim' else formats[key] out.append(item) except AttributeError: out.append(part) return "".join(out)
A test in the end:
import regex as re datestrings = [datetime.now().isoformat(), '2006-11-02', 'Thursday, 10 August 2006 08:42:51', 'August 9, 1995', 'Aug 9, 1995', 'Thu, 01 Jan 1970 00:00:00', '21/11/06 16:30', '06 Jun 2017 20:33:10'] # test for dt in datestrings: print("Date: {}, Format: {}".format(dt, GuessFormat(dt)))
This yields:
Date: 2017-06-07T22:02:05.001811, Format: %Y-%m-%dT%H:%M:%S.%f Date: 2006-11-02, Format: %Y-%m-%d Date: Thursday, 10 August 2006 08:42:51, Format: %A, %m %B %Y %H:%M:%S Date: August 9, 1995, Format: %B %m, %Y Date: Aug 9, 1995, Format: %b %m, %Y Date: Thu, 01 Jan 1970 00:00:00, Format: %a, %m %b %Y %H:%M:%S Date: 21/11/06 16:30, Format: %d/%m/%d %H:%M Date: 06 Jun 2017 20:33:10, Format: %d %b %Y %H:%M:%S
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With