I have a bunch of excel documents I am extracting dates from. I am trying to convert these to a standard format so I can put them in a database. Is there a function I can throw these strings at and get a standard format back? Here is a small sample of my data:
The good thing is I know it is always Month/Day
10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010
I'd like to get them all into MM/DD/YYYY format. Is there a way I can do this without trying each pattern against the string?
Python has a built-in method to parse dates, strptime . This example takes the string “2020–01–01 14:00” and parses it to a datetime object. The documentation for strptime provides a great overview of all format-string options.
You just forgot to use %d in order to capture the date number and the : for the time and you ALSO need to capture +0000 . Show activity on this post.
Use datetime. strftime(format) to convert a datetime object into a string as per the corresponding format . The format codes are standard directives for mentioning in which format you want to represent datetime. For example, the %d-%m-%Y %H:%M:%S codes convert date to dd-mm-yyyy hh:mm:ss format.
The third-party module dateutil has a function parse
that operates similarly to PHP's strtotime
: you don't need to specify a particular date format, it just tries a bunch of its own.
>>> from dateutil.parser import parse
>>> parse("10/02/09", fuzzy=True)
datetime.datetime(2009, 10, 2, 0, 0) # default to be in American date format
It also allows you to specify different assumptions:
- dayfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day (True) or month (False). If yearfirst is set to True, this distinguishes between YDM and YMD. If set to None, this value is retrieved from the current parserinfo object (which itself defaults to False).
- yearfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the year. If True, the first number is taken to be the year, otherwise the last number is taken to be the year. If this is set to None, the value is retrieved from the current parserinfo object (which itself defaults to False).
import re
ss = '''10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010'''
regx = re.compile('[-/]')
for xd in ss.splitlines():
m,d,y = regx.split(xd)
print xd,' ','/'.join((m.zfill(2),d.zfill(2),'20'+y.zfill(2) if len(y)==2 else y))
result
10/02/09 10/02/2009
07/22/09 07/22/2009
09-08-2008 09/08/2008
9/9/2008 09/09/2008
11/4/2010 11/04/2010
03-07-2009 03/07/2009
09/01/2010 09/01/2010
And Edit 2 : taking account of the information on '{0:0>2}'.format(day)
from JBernardo, I added a 4th solution, that appears to be the fastest
import re
from time import clock
iterat = 100
from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010',
' 03-07-2009', '09/01/2010']
reobj = re.compile(
r"""\s* # optional whitespace
(\d+) # Month
[-/] # separator
(\d+) # Day
[-/] # separator
(?:20)? # century (optional)
(\d+) # years (YY)
\s* # optional whitespace""",
re.VERBOSE)
te = clock()
for i in xrange(iterat):
ndates = (reobj.sub(r"\1/\2/20\3", date) for date in dates)
fdates1 = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
for date in ndates]
print "Tim's method ",clock()-te,'seconds'
regx = re.compile('[-/]')
te = clock()
for i in xrange(iterat):
ndates = (reobj.match(date).groups() for date in dates)
fdates2 = ['%s/%s/20%s' % tuple(x.zfill(2) for x in tu) for tu in ndates]
print "mixing solution",clock()-te,'seconds'
te = clock()
for i in xrange(iterat):
ndates = (regx.split(date.strip()) for date in dates)
fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2) if len(y)==2 else y)))
for m,d,y in ndates]
print "eyquem's method",clock()-te,'seconds'
te = clock()
for i in xrange(iterat):
fdates4 = ['{:0>2}/{:0>2}/20{}'.format(*reobj.match(date).groups()) for date in dates]
print "Tim + format ",clock()-te,'seconds'
print fdates1==fdates2==fdates3==fdates4
result
number of iteration's turns : 100
Tim's method 0.295053700959 seconds
mixing solution 0.0459111423379 seconds
eyquem's method 0.0192239516475 seconds
Tim + format 0.0153756971906 seconds
True
The mixing solution is interesting because it combines the speed of my solution and the ability of the regex of Tim Pietzcker to detect dates in a string.
That's still more true for the solution combining Tim's one and the formating with {:0>2}
. I cant' combine {:0>2}
with mine because regx.split(date.strip())
produces year with 2 OR 4 digits
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With