Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for extracting all complex dates formats from a string in python

Tags:

python

date

regex

I have following string:

 dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"

Here I want to extract all mentioned dates using regex. As an attempt I have written following regex:

import re

regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)?(?:\d{2,4})'

re.findall(regEx, dateEntries)

I was expecting this to work but it only return subset of dates.

A = ['Mar 20, 2009',
 'March 20, 2009',
 'Mar. 20, 2009',
 'Mar 20 2009',
 '20 Mar 2009',
 '20 March 2009',
 '2 Mar. 2009',
 '20 March, 2009',
 'Mar 20th, 2009',
 'Mar 21st, 2009',
 'Mar 22nd, 2009',
 'Feb 2009',
 'Sep 2009',
 'Oct 2010']

I'm not getting why its not returning the dates:

B=[04-20-2009; 04/20/09; 4/20/09; 4/3/09; 6/2008; 12/2009; 2009; 2010"]

I created the regEx by extending the r'(?:\d{1,2}[-\s\/])?(?:\d{1,2}[-\/\s])?(?:\d{2,4})' which works good for set B. But regEx is not able to produce A+B

Can anyone help in making a regex for extracting all dates mentioned in my dateEntries ?

NOTE: I want to solve this using regex only.

like image 210
Amit Sharma Avatar asked Jan 02 '23 04:01

Amit Sharma


2 Answers

You are just missing a single ? after the (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) group to mark it as not necessary. Additionally I added a + behind the last two groups to make sure the regex doesn't split dates like "20 March 2009" into two different dates.

The full code:

import re

regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+'

dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
result = re.findall(regEx, dateEntries)
print(result)

If your date has leading whitespaces, the result will also have leading whitespaces. If you continue using the date string you could remove them for example with the .strip() method

like image 137
Nils Schlüter Avatar answered Jan 05 '23 05:01

Nils Schlüter


Your regex pattern is totally unreadable.. Please build your regex pattern with simple building blocks. That would make the code a lot more readable

import re
import calendar

full_months = [month for month in calendar.month_name if month]
short_months = [d[:3] for d in full_months]
months = '|'.join(short_months + full_months)

sep = r'[.,]?\s+'               # seperator
day = r'\d+'
year = r'\d+'
day_or_year = r'\d+(?:\w+)?'

r = re.compile(rf'(?:{day}{sep})?(?:{months}){sep}{day_or_year}(?:{sep}{year})?')
r.findall(dateEntries)
# ['Mar 20, 2009', 'March 20, 2009', 'Mar. 20, 2009', 'Mar 20 2009', '20 Mar 2009', '20 March 2009', '2 Mar. 2009', '20 March, 2009', 'Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009', 'Feb 2009', 'Sep 2009', 'Oct 2010']
like image 21
Sunitha Avatar answered Jan 05 '23 04:01

Sunitha