I want to be able to read a string and return the first date appears in it. Is there a ready module that I can use? I tried to write regexs for all possible date format, but it is quite long. Is there a better way to do it?
Use strftime() to display Time and Date The strftime() method returns a string displaying date and time using date, time or datetime object. You have created a new file named python_time.py. Use this file to define the format of the information the system is to display.
Extracting Dates from a Text File with the Datefinder Module. The Python datefinder module can locate dates in a body of text. Using the find_dates() method, it's possible to search text data for many different types of dates. Datefinder will return any dates it finds in the form of a datetime object.
today() method to get the current local date. By the way, date. today() returns a date object, which is assigned to the today variable in the above program. Now, you can use the strftime() method to create a string representing date in different formats.
You can run a date parser on all subtexts of your text and pick the first date. Of course, such solution would either catch things that are not dates or would not catch things that are, or most likely both.
Let me provide an example that uses dateutil.parser
to catch anything that looks like a date:
import dateutil.parser
from itertools import chain
import re
# Add more strings that confuse the parser in the list
UNINTERESTING = set(chain(dateutil.parser.parserinfo.JUMP,
dateutil.parser.parserinfo.PERTAIN,
['a']))
def _get_date(tokens):
for end in xrange(len(tokens), 0, -1):
region = tokens[:end]
if all(token.isspace() or token in UNINTERESTING
for token in region):
continue
text = ''.join(region)
try:
date = dateutil.parser.parse(text)
return end, date
except ValueError:
pass
def find_dates(text, max_tokens=50, allow_overlapping=False):
tokens = filter(None, re.split(r'(\S+|\W+)', text))
skip_dates_ending_before = 0
for start in xrange(len(tokens)):
region = tokens[start:start + max_tokens]
result = _get_date(region)
if result is not None:
end, date = result
if allow_overlapping or end > skip_dates_ending_before:
skip_dates_ending_before = end
yield date
test = """Adelaide was born in Finchley, North London on 12 May 1999. She was a
child during the Daleks' abduction and invasion of Earth in 2009.
On 1st July 2058, Bowie Base One became the first Human colony on Mars. It
was commanded by Captain Adelaide Brooke, and initially seemed to prove that
it was possible for Humans to live long term on Mars."""
print "With no overlapping:"
for date in find_dates(test, allow_overlapping=False):
print date
print "With overlapping:"
for date in find_dates(test, allow_overlapping=True):
print date
The result from the code is, quite unsurprisingly, rubbish whether you allow overlapping or not. If overlapping is allowed, you get a lot of dates that are nowhere to be seen, and if if it is not allowed, you miss the important date in the text.
With no overlapping:
1999-05-12 00:00:00
2009-07-01 20:58:00
With overlapping:
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-03 00:00:00
1999-05-03 00:00:00
1999-07-03 00:00:00
1999-07-03 00:00:00
2009-07-01 20:58:00
2009-07-01 20:58:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
Essentially, if overlapping is allowed:
If, however, overlapping is not allowed, "2009. On 1st July 2058" is parsed as 2009-07-01 20:58:00 and no attempt is made to parse the date after the period.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With