Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract dates, times and date ranges from text in PHP

Tags:

date

regex

php

I'm building a local events calendar which takes RSS feeds and website scrapes and extracts event dates from them.

I've previously asked how to extract dates from text in PHP here, and received a good answer at the time from MarcDefiant:

function parse_date_tokens($tokens) {
  # only try to extract a date if we have 2 or more tokens
  if(!is_array($tokens) || count($tokens) < 2) return false;
  return strtotime(implode(" ", $tokens));
}

function extract_dates($text) {
  static $patterns = Array(
    '/^[0-9]+(st|nd|rd|th|)?$/i', # day
    '/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month
    '/^20[0-9]{2}$/', # year
    '/^of$/' #words
  );
  # defines which of the above patterns aren't actually part of a date
  static $drop_patterns = Array(
    false,
    false,
    false,
    true
  );
  $tokens = Array();
  $result = Array();
  $text = str_word_count($text, 1, '0123456789'); # get all words in text

  # iterate words and search for matching patterns
  foreach($text as $word) {
    $found = false;
    foreach($patterns as $key => $pattern) {
      if(preg_match($pattern, $word)) {
        if(!$drop_patterns[$key]) {
          $tokens[] = $word;
        }
        $found = true;
        break;
      }
    }

    if(!$found) {
      $result[] = parse_date_tokens($tokens);
      $tokens = Array();
    }
  }
  $result[] = parse_date_tokens($tokens);

  return array_filter($result);
}

# test
$texts = Array(
  "The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
  "Valentines Special @ The Radisson, Feb 14th",
  "On Friday the 15th of February, a special Hollywood themed [...]",
  "Symposium on Childhood Play on Friday, February 8th",
  "Hosting a craft workshop March 9th - 11th in the old [...]"
);

$dates = extract_dates(implode(" ", $texts));
echo "Dates: \n";
foreach($dates as $date) {
  echo "  " . date('d.m.Y H:i:s', $date) . "\n";
}

However, the solution has some downsides - for one thing, it can't match date ranges.

I'm now looking for a more complex solution that can extract dates, times and date ranges from sample text.

Whats the best approach for this? It seems like I'm leaning back toward a series of regex statements run one after the other to catch these cases. I can't see a better way of catching date ranges in particular, but I know there must be a better way of doing this. Are there any libraries out there just for date parsing in PHP?

Date / Date Range samples, as requested

$dates = [
    " Saturday 28th December",
    "2013/2014",
    "Friday 10th of January",
    "Thursday 19th December",
    " on Sunday the 15th December at 1 p.m",
    "On Saturday December 14th ",
    "On Saturday December 21st at 7.30pm",
    "Saturday, March 21st, 9.30 a.m.",
    "Jan-April 2014",
    "January 21st - Jan 24th 2014",
    "Dec 30th - Jan 3rd, 2014",
    "February 14th-16th, 2014",
    "Mon 14 - Wed 16 April, 12 - 2pm",
    "Sun 13 April, 8pm",
    "Mon 21 - Wed 23 April",
    "Friday 25 April, 10 – 3pm",            
    "The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
    "Valentines Special @ The Radisson, Feb 14th",
    "On Friday the 15th of February, a special Hollywood themed [...]",
    "Symposium on Childhood Play on Friday, February 8th",
    "Hosting a craft workshop March 9th - 11th in the old [...]"
];

The function I'm currently using (not the above) is about 90% accurate. It can catch date ranges, but has difficulty if a time is also specified. It uses a list of regex expressions and is very convoluted.

UPDATE: Jan 6th, 2014

I'm working on code that does this, working on my original method of a series of regex statements run one after the other. I think I'm close to a working solution that can pretty much extract almost any date/time range / format from a piece of text. When I'm done I'll post it here as an answer.

like image 617
roryok Avatar asked Dec 30 '13 10:12

roryok


1 Answers

I think you can sum up the regex in your question like the one below.

(?<date_format_1>(?<day>(?i)\b\s*[0-9]+(?:st|nd|rd|th|)?)(?<month>(?i)\b\s*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|etc))(?<year>\b\s*20[0-9]{2}) ) |
(?<date_format_2>(?&month)(?&day)(?!\s+-)) |
(?<date_format_3>(?&day)\s+of\s+(?&month)) |
(?<range_type_1>(?&month)(?&day)\s+-\s+(?&day))

Flags: x

Description

Regular expression visualization

Demo

http://regex101.com/r/wP5fR4

Discussion

By using recursive subpatterns, you reduce the complexity of the final regex. I have used a negative lookahead in the date_format_2 because it would match partially range_type_1. You may need to add more range type depending on your data. Don't forget to check other partterns in case of partial match.

Another solution would consist in build small regexes in different string variables and then concatenate them in PHP to build a bigger regex.

like image 144
Stephan Avatar answered Oct 23 '22 05:10

Stephan