Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching dates with regular expressions in Python?

I know that there are similar questions to mine that have been answered, but after reading through them I still don't have the solution I'm looking for.

Using Python 3.2.2, I need to match "Month, Day, Year" with the Month being a string, Day being two digits not over 30, 31, or 28 for February and 29 for February on a leap year. (Basically a REAL and Valid date)

This is what I have so far:

pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"
expression = re.compile(pattern)
matches = expression.findall(sampleTextFile)

I'm still not too familiar with regex syntax so I may have characters in there that are unnecessary (the [,][ ] for the comma and spaces feels like the wrong way to go about it), but when I try to match "January, 26, 1991" in my sample text file, the printing out of the items in "matches" is ('January', '26', '1991', '19').

Why does the extra '19' appear at the end?

Also, what things could I add to or change in my regex that would allow me to validate dates properly? My plan right now is to accept nearly all dates and weed them out later using high level constructs by comparing the day grouping with the month and year grouping to see if the day should be <31,30,29,28

Any help would be much appreciated including constructive criticism on how I am going about designing my regex.

like image 707
ahabos Avatar asked Apr 25 '12 03:04

ahabos


People also ask

What is match () function in Python?

match() both are functions of re module in python. These functions are very efficient and fast for searching in strings. The function searches for some substring in a string and returns a match object if found, else it returns none.


1 Answers

Here's one way to make a regular expression that will match any date of your desired format (though you could obviously tweak whether commas are optional, add month abbreviations, and so on):

years = r'((?:19|20)\d\d)'
pattern = r'(%%s) +(%%s), *%s' % years

thirties = pattern % (
     "September|April|June|November",
     r'0?[1-9]|[12]\d|30')

thirtyones = pattern % (
     "January|March|May|July|August|October|December",
     r'0?[1-9]|[12]\d|3[01]')

fours = '(?:%s)' % '|'.join('%02d' % x for x in range(4, 100, 4))

feb = r'(February) +(?:%s|%s)' % (
     r'(?:(0?[1-9]|1\d|2[0-8])), *%s' % years, # 1-28 any year
     r'(?:(29), *((?:(?:19|20)%s)|2000))' % fours)  # 29 leap years only

result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
r = re.compile(result)
print result

Then we have:

>>> r.match('January 30, 2001') is not None
True
>>> r.match('January 31, 2001') is not None
True
>>> r.match('January 32, 2001') is not None
False
>>> r.match('February 32, 2001') is not None
False
>>> r.match('February 29, 2001') is not None
False
>>> r.match('February 28, 2001') is not None
True
>>> r.match('February 29, 2000') is not None
True
>>> r.match('April 30, 1908') is not None
True
>>> r.match('April 31, 1908') is not None
False

And what is this glorious regexp, you may ask?

>>> print result
(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))

(I initially intended to do a tongue-in-cheek enumeration of the possible dates, but I basically ended up hand-writing that whole gross thing except for the multiples of four, anyway.)

like image 51
Danica Avatar answered Nov 03 '22 13:11

Danica