I have a string that has several date values in it, and I want to parse them all out. The string is natural language, so the best thing I've found so far is dateutil.
Unfortunately, if a string has multiple date values in it, dateutil throws an error:
>>> s = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
>>> parse(s, fuzzy=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 303, in parse
raise ValueError, "unknown string format"
ValueError: unknown string format
Any thoughts on how to parse all dates from a long string? Ideally, a list would be created, but I can handle that myself if I need to.
I'm using Python, but at this point, other languages are probably OK, if they get the job done.
PS - I guess I could recursively split the input file in the middle and try, try again until it works, but it's a hell of a hack.
Python has a built-in method to parse dates, strptime . This example takes the string “2020–01–01 14:00” and parses it to a datetime object. The documentation for strptime provides a great overview of all format-string options.
The Date. parse() method parses a string representation of a date, and returns the number of milliseconds since January 1, 1970, 00:00:00 UTC or NaN if the string is unrecognized or, in some cases, contains illegal date values (e.g. 2015-02-31). Only the ISO 8601 format ( YYYY-MM-DDTHH:mm:ss.
Looking at it, the least hacky way would be to modify dateutil parser to have a fuzzy-multiple option.
parser._parse
takes your string, tokenizes it with _timelex
and then compares the tokens with data defined in parserinfo
.
Here, if a token doesn't match anything in parserinfo
, the parse will fail unless fuzzy
is True.
What I suggest you allow non-matches while you don't have any processed time tokens, then when you hit a non-match, process the parsed data at that point and start looking for time tokens again.
Shouldn't take too much effort.
Update
While you're waiting for your patch to get rolled in...
This is a little hacky, uses non-public functions in the library, but doesn't require modifying the library and is not trial-and-error. You might have false positives if you have any lone tokens that can be turned into floats. You might need to filter the results some more.
from dateutil.parser import _timelex, parser
a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
p = parser()
info = p.info
def timetoken(token):
try:
float(token)
return True
except ValueError:
pass
return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))
def timesplit(input_string):
batch = []
for token in _timelex(input_string):
if timetoken(token):
if info.jump(token):
continue
batch.append(token)
else:
if batch:
yield " ".join(batch)
batch = []
if batch:
yield " ".join(batch)
for item in timesplit(a):
print "Found:", item
print "Parsed:", p.parse(item)
Yields:
Found: 2011 04 23 Parsed: 2011-04-23 00:00:00 Found: 29 July 1928 Parsed: 1928-07-29 00:00:00
Update for Dieter
Dateutil 2.1 appears to be written for compatibility with python3 and uses a "compatability" library called six
. Something isn't right with it and it's not treating str
objects as text.
This solution works with dateutil 2.1 if you pass strings as unicode or as file-like objects:
from cStringIO import StringIO
for item in timesplit(StringIO(a)):
print "Found:", item
print "Parsed:", p.parse(StringIO(item))
If you want to set option on the parserinfo, instantiate a parserinfo and pass it to the parser object. E.g:
from dateutil.parser import _timelex, parser, parserinfo
info = parserinfo(dayfirst=True)
p = parser(info)
While I was offline, I was bothered by the answer I posted here yesterday. Yes it did the job, but it was unnecessarily complicated and extremely inefficient.
Here's the back-of-the-envelope edition that should do a much better job!
import itertools
from dateutil import parser
jumpwords = set(parser.parserinfo.JUMP)
keywords = set(kw.lower() for kw in itertools.chain(
parser.parserinfo.UTCZONE,
parser.parserinfo.PERTAIN,
(x for s in parser.parserinfo.WEEKDAYS for x in s),
(x for s in parser.parserinfo.MONTHS for x in s),
(x for s in parser.parserinfo.HMS for x in s),
(x for s in parser.parserinfo.AMPM for x in s),
))
def parse_multiple(s):
def is_valid_kw(s):
try: # is it a number?
float(s)
return True
except ValueError:
return s.lower() in keywords
def _split(s):
kw_found = False
tokens = parser._timelex.split(s)
for i in xrange(len(tokens)):
if tokens[i] in jumpwords:
continue
if not kw_found and is_valid_kw(tokens[i]):
kw_found = True
start = i
elif kw_found and not is_valid_kw(tokens[i]):
kw_found = False
yield "".join(tokens[start:i])
# handle date at end of input str
if kw_found:
yield "".join(tokens[start:])
return [parser.parse(x) for x in _split(s)]
Example usage:
>>> parse_multiple("I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928")
[datetime.datetime(2011, 4, 23, 0, 0), datetime.datetime(1928, 7, 29, 0, 0)]
It's probably worth noting that its behaviour deviates slightly from dateutil.parser.parse
when dealing with empty/unknown strings. Dateutil will return the current day, while parse_multiple
returns an empty list which, IMHO, is what one would expect.
>>> from dateutil import parser
>>> parser.parse("")
datetime.datetime(2011, 8, 12, 0, 0)
>>> parse_multiple("")
[]
P.S. Just spotted MattH's updated answer which does something very similar.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With