Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to identify and extract dates from text Python?

As part of a larger personal project I'm working on, I'm attempting to separate out inline dates from a variety of text sources.

For example, I have a large list of strings (that usually take the form of English sentences or statements) that take a variety of forms:

Central design committee session Tuesday 10/22 6:30 pm

Th 9/19 LAB: Serial encoding (Section 2.2)

There will be another one on December 15th for those who are unable to make it today.

Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm

He will be flying in Sept. 15th.

While these dates are in-line with natural text, none of them are in specifically natural language forms themselves (e.g., there's no "The meeting will be two weeks from tomorrow"—it's all explicit).

As someone who doesn't have too much experience with this kind of processing, what would be the best place to begin? I've looked into things like the dateutil.parser module and parsedatetime, but those seem to be for after you've isolated the date.

Because of this, is there any good way to extract the date and the extraneous text

input:  Th 9/19 LAB: Serial encoding (Section 2.2) output: ['Th 9/19', 'LAB: Serial encoding (Section 2.2)'] 

or something similar? It seems like this sort of processing is done by applications like Gmail and Apple Mail, but is it possible to implement in Python?

like image 213
redct Avatar asked Nov 15 '13 05:11

redct


People also ask

How do you extract a string from text in Python?

You can extract a substring from a string before a specific character using the rpartition() method. rpartition() method partitions the given string based on the last occurrence of the delimiter and it generates tuples that contain three elements where.

What is Datefinder in Python?

A python module for locating dates inside text. Use this package to extract all sorts of date like strings from a document and turn them into datetime objects. This module finds the likely datetime strings and then uses dateutil to convert to the datetime object.

How do you pull a date in Python?

today() method to get the current local date. By the way, date. today() returns a date object, which is assigned to the today variable in the above program. Now, you can use the strftime() method to create a string representing date in different formats.


2 Answers

I was also looking for a solution to this and couldn't find any, so a friend and I built a tool to do this. I thought I would come back and share incase others found it helpful.

datefinder -- find and extract dates inside text

Here's an example:

import datefinder  string_with_dates = '''     Central design committee session Tuesday 10/22 6:30 pm     Th 9/19 LAB: Serial encoding (Section 2.2)     There will be another one on December 15th for those who are unable to make it today.     Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm     He will be flying in Sept. 15th.     We expect to deliver this between late 2021 and early 2022. '''  matches = datefinder.find_dates(string_with_dates) for match in matches:     print(match) 
like image 79
akoumjian Avatar answered Oct 06 '22 21:10

akoumjian


I am surprised that there is no mention of SUTime and dateparser's search_dates method.

from sutime import SUTime import os import json from dateparser.search import search_dates  str1 = "Let's meet sometime next Thursday"   # You'll get more information about these jar files from SUTime's github page jar_files = os.path.join(os.path.dirname(__file__), 'jars') sutime = SUTime(jars=jar_files, mark_time_ranges=True)  print(json.dumps(sutime.parse(str1), sort_keys=True, indent=4)) """output:  [     {         "end": 33,         "start": 20,         "text": "next Thursday",         "type": "DATE",         "value": "2018-10-11"     } ] """  print(search_dates(str1)) #output: #[('Thursday', datetime.datetime(2018, 9, 27, 0, 0))] 

Although I have tried other modules like dateutil, datefinder and natty (couldn't get duckling to work with python), this two seem to give the most promising results.

The results from SUTime are more reliable and it's clear from the above code snippet. However, the SUTime fails in some basic scenarios like parsing a text

"I won't be available until 9/19"

or

"I won't be available between (September 18-September 20).

It gives no result for the first text and only gives month and year for the second text. This is however handled quite well in the search_dates method. search_dates method is more aggressive and will give all possible dates related to any words in the input text.

I haven't yet found a way to parse the text strictly for dates in search_methods. If I could find a way to do that, it'll be my first choice over SUTime and I would also make sure to update this answer if I find it.

like image 43
Afsan Abdulali Gujarati Avatar answered Oct 06 '22 22:10

Afsan Abdulali Gujarati