I have to migrate a very large dataset from one system to another. One of the "source" column contains a date but is really a string with no constraint, while the destination system mandates a date in the format yyyy-mm-dd.
Many, but not all, of the source dates are formatted as yyyymmdd. So to coerce them to the expected format, I do (in Perl):
return "$1-$2-$3" if ($val =~ /(\d{4})[-\/]*(\d{2})[-\/]*(\d{2})/);
The problem arises when the source dates moves away from the "generic" yyyymmdd. The goal is to salvage as many dates as possible, before giving up. Example source strings include:
21/3/1998, March 2004, 2001, 3/4/97
I can try to match as many of the examples I can find with a succession of regular expressions such as the one above.
But is there something smarter to do? Am I not reinventing the wheel? Is there a library somewhere doing something similar? I couldn't find anything relevant googling "forgiving date parser". (any language is OK).
Date::Manip is your friend, as is fails on only one out of four because it assumes US format, using Date_Init you can get 4 out of 4.
If you have different formats (ie, month before day and viceversa) you'd have to parse them differently, once with US date format and the next with a non-US date format. This is especially important when it's ambiguous, like your 3/4/97 example, because if it's 21/3 it just fails and you can tell the format is wrong.
vinko@mithril:~$ more date.pl
use strict;
use warnings;
use Date::Manip;
my @a;
push @a, "March 2004";
push @a, "2001";
push @a, "3/4/97";
push @a, "21/3/1998";
Date_Init("DateFormat=non-US");
for my $d (@a) {
print "$d\n";
print ParseDate($d)."\n";
};
vinko@mithril:~$ perl date.pl
March 2004
2004030100:00:00
2001
2001010100:00:00
3/4/97
1997040300:00:00
21/3/1998
1998032100:00:00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With