Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

heuristic (fuzzy) date extraction from the string?

I have a problem to heuristically parse a string of text which contains a date but in a rather arbitrary (unknown) format.

function parseDateStr($text) {
    $cleanText = filter($text);
    # ...
    $day = findDay($cleanText);
    $month = findMonth($cleanText);
    $year = findYear($cleanText);
    # .. assert constraints, parse again or fail
    return sprintf('%04d-%02d-%02d', $year, $month, $day)
}

Input text is a sentence in English language plus arbitrary syntax symbols (like a subset of \W regexp class). The task of the algorithm is to extract date only after filtering away any potential garbage (noisy) words, unrelated to the date. It is allowed that the algorithm could fail and return no result. If only two combination of two joined digits (MM) together with four other digits (YYYY) were found in the string - it is assumed that two digits corresponds to the month of the date and the day is taken to be 01 (first day of the month). Result gives a date in "YYYY-MM-DD" (SQL) format (of type DATE).

My idea is to proceed with designing a series of filters using preg_replace & co. Further, use logical constraints on the range of $year, $day, use a vocabulary for $month, etc., but I would not be surprised if similar but more elegant solutions or approaches are thinkable or already exist. If so, please let me know about them. I would also appreciate if any critics or potential pitfalls can be pointed out.

Relation to similar questions:

Please note that the question is different from more basic date parsing questions as:

  • PHP Parse Date String
  • How to parse any date format

since in my case I can not specify or determine the format of the string. On the other hand the following questions talk about similar tasks:

  • Extracting date from a string in Python
  • Extract multiple date format from few string variables in php
  • Extracting date from a string in PHP

I am not sure if the last one is a duplicate, it is not ultimately clear to me what OP wants to parse (although checkdate and date_parse seem to be partially useful). But the first question on the whole "mokey business" is also true for my case and has been addressed by fuzzy parsing as in

dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)

Finally, the second one contains great grabbing regexp (almost "fuzzy").

PS by elegant I understand that the code is rather compact (there is no significant limitations on performance, so using "hacky" regexps is ok).

like image 736
Yauhen Yakimovich Avatar asked Mar 11 '13 23:03

Yauhen Yakimovich


1 Answers

timelib

Well, date_parse is performing very very well and it was very educational to learn why. PHP function date_parse is a part of ext/date/lib or timelib, and apparently (despite lack of proper documentation) its implementation in C (written by Derick Rethans and called from the Zend Engine macros part with declarations) makes it a clever tool:

  1. date_parse is already fuzzy: there are a lot of warnings (and complains) on the documentation page that function tolerates and parses too much but obviously it is actually a feature and not a bug (otherwise one should use date_parse_from_format or respective DateTime::createFromFormat())
  2. date_parse uses (a lot of) regular expressions in a relatively smart way (based on re2c)
  3. In addition to filtering this "scanner" looks for all possible combinations of words and date formats (from the list of known months and timezones), and, finally, just makes a "blindly" guess by looking for YYYY, MM and DD "separately" (very similar to what I need to do).
  4. date_parse is a true compiled "scanner" that comes with look-ahead logic and error reporting that can be handled further by user (no exceptions, just messages inside the nested array of results).
  5. There is even a python package wrapping the C code of timelib (so I am even not sure which is ultimately better in "parsing the monkey business" timelib or python-dateutil)

testing and examples

From my part, I have failed to find any input example from my dataset that was not parsed by date_parse, i.e.:

echo FuzzyDateParser::fromText('banana 1/2/3');
echo FuzzyDateParser::fromText('Joe Soap was born on 12 February 1981'));
echo FuzzyDateParser::fromText('2005 Feb., reprint'));
echo FuzzyDateParser::fromText('!'); # will fail to parse, producing an empty string.
echo FuzzyDateParser::fromText('monkey 2010-07-10 loves bananas and php');

The code for FuzzyDateParser class can be found in this gist. It can be useful as a template to handle errors and implement a fallback from date_parse results to own custom logic (which I eventually did not have to do for my case).

like image 97
Yauhen Yakimovich Avatar answered Sep 24 '22 18:09

Yauhen Yakimovich