I have a problem to heuristically parse a string of text which contains a date but in a rather arbitrary (unknown) format.
function parseDateStr($text) {
$cleanText = filter($text);
# ...
$day = findDay($cleanText);
$month = findMonth($cleanText);
$year = findYear($cleanText);
# .. assert constraints, parse again or fail
return sprintf('%04d-%02d-%02d', $year, $month, $day)
}
Input text is a sentence in English language plus arbitrary syntax symbols (like a subset of \W regexp class). The task of the algorithm is to extract date only after filtering away any potential garbage (noisy) words, unrelated to the date. It is allowed that the algorithm could fail and return no result. If only two combination of two joined digits (MM) together with four other digits (YYYY) were found in the string - it is assumed that two digits corresponds to the month of the date and the day is taken to be 01 (first day of the month). Result gives a date in "YYYY-MM-DD" (SQL) format (of type DATE).
My idea is to proceed with designing a series of filters using preg_replace & co. Further, use logical constraints on the range of $year, $day, use a vocabulary for $month, etc., but I would not be surprised if similar but more elegant solutions or approaches are thinkable or already exist. If so, please let me know about them. I would also appreciate if any critics or potential pitfalls can be pointed out.
Relation to similar questions:
Please note that the question is different from more basic date parsing questions as:
since in my case I can not specify or determine the format of the string. On the other hand the following questions talk about similar tasks:
I am not sure if the last one is a duplicate, it is not ultimately clear to me what OP wants to parse (although checkdate and date_parse seem to be partially useful). But the first question on the whole "mokey business" is also true for my case and has been addressed by fuzzy parsing as in
dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
Finally, the second one contains great grabbing regexp (almost "fuzzy").
PS by elegant I understand that the code is rather compact (there is no significant limitations on performance, so using "hacky" regexps is ok).
Well, date_parse is performing very very well and it was very educational to learn why. PHP function date_parse is a part of ext/date/lib or timelib, and apparently (despite lack of proper documentation) its implementation in C (written by Derick Rethans and called from the Zend Engine macros part with declarations) makes it a clever tool:
From my part, I have failed to find any input example from my dataset that was not parsed by date_parse, i.e.:
echo FuzzyDateParser::fromText('banana 1/2/3');
echo FuzzyDateParser::fromText('Joe Soap was born on 12 February 1981'));
echo FuzzyDateParser::fromText('2005 Feb., reprint'));
echo FuzzyDateParser::fromText('!'); # will fail to parse, producing an empty string.
echo FuzzyDateParser::fromText('monkey 2010-07-10 loves bananas and php');
The code for FuzzyDateParser class can be found in this gist. It can be useful as a template to handle errors and implement a fallback from date_parse results to own custom logic (which I eventually did not have to do for my case).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With