Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract inconsistently formatted date from string (date parsing, NLP)

Tags:

date

perl

nlp

I have a large list of files, some of which have dates embedded in the filename. The format of the dates is inconsistent and often incomplete, e.g. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004" etc. In addition to that, some filenames have unrelated numbers that look somewhat like dates, e.g. "20202010".

In short, the dates are normally incomplete, sometimes not there, are inconsistently formatted and are embedded in a string with other information, e.g. "Report Aug06.xls".

Are there any Perl modules available which will do a decent job of guessing the date from such a string? It doesn't have to be 100% correct, as it will be verified by a human manually, but I'm trying to make things as easy as possible for that person and there are thousands of entries to check :)

like image 802
El Yobo Avatar asked Aug 10 '10 01:08

El Yobo


1 Answers

Date::Parse is definitely going to be part of your answer - the bit that works out a randomly formatted date-like string and make an actual useable date out of it.

The other part of your problem - the rest of the characters in your filenames - is unusual enough that you're unlikely to find someone else has packaged up a module for you.

Without seeing more of your sample data, it's really only possible to guess, but I'd start by identifying possible or likely "date section" candidates.

Here's a nasty brute-force example using Date::Parse (a smarter approach would use a list of regex-en to try and identify dates-bits - I'm happy to burn cpu cycles to not think quite so hard though!)

!/usr/bin/perl
use strict;
use warnings;
use Date::Parse;

my @files=("Report Aug06.xls", "ReportAug2006", "Report 11th September 2006.xls", 
           "Annual Report-08-06", "End-of-month Report01-08-06.xls", "Report2006");

# assumption - longest likely date string is something like '11th September 2006' - 19 chars
# shortest is "2006" - 4 chars.
# brute force all strings from 19-4 chars long at the end of the filename (less extension)
# return the longest thing that Date::Parse recognises as a date



foreach my $file (@files){
  #chop extension if there is one
  $file=~s/\..*//;
  for my $len (-19..-4){
    my $string = substr($file, $len);
    my $time = str2time($string);
    print "$string is a date: $time = ",scalar(localtime($time)),"\n" if $time;
    last if $time;
    }
  }
like image 185
bigiain Avatar answered Oct 13 '22 22:10

bigiain