I have following string:
dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
Here I want to extract all mentioned dates using regex
. As an attempt I have written following regex
:
import re
regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)?(?:\d{2,4})'
re.findall(regEx, dateEntries)
I was expecting this to work but it only return subset of dates.
A = ['Mar 20, 2009',
'March 20, 2009',
'Mar. 20, 2009',
'Mar 20 2009',
'20 Mar 2009',
'20 March 2009',
'2 Mar. 2009',
'20 March, 2009',
'Mar 20th, 2009',
'Mar 21st, 2009',
'Mar 22nd, 2009',
'Feb 2009',
'Sep 2009',
'Oct 2010']
I'm not getting why its not returning the dates:
B=[04-20-2009; 04/20/09; 4/20/09; 4/3/09; 6/2008; 12/2009; 2009; 2010"]
I created the regEx
by extending the r'(?:\d{1,2}[-\s\/])?(?:\d{1,2}[-\/\s])?(?:\d{2,4})'
which works good for set B. But regEx
is not able to produce A+B
Can anyone help in making a regex for extracting all dates mentioned in my dateEntries
?
NOTE: I want to solve this using regex only.
You are just missing a single ?
after the (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
group to mark it as not necessary. Additionally I added a +
behind the last two groups to make sure the regex doesn't split dates like "20 March 2009" into two different dates.
The full code:
import re
regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+'
dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
result = re.findall(regEx, dateEntries)
print(result)
If your date has leading whitespaces, the result will also have leading whitespaces. If you continue using the date string you could remove them for example with the .strip()
method
Your regex pattern is totally unreadable.. Please build your regex pattern with simple building blocks. That would make the code a lot more readable
import re
import calendar
full_months = [month for month in calendar.month_name if month]
short_months = [d[:3] for d in full_months]
months = '|'.join(short_months + full_months)
sep = r'[.,]?\s+' # seperator
day = r'\d+'
year = r'\d+'
day_or_year = r'\d+(?:\w+)?'
r = re.compile(rf'(?:{day}{sep})?(?:{months}){sep}{day_or_year}(?:{sep}{year})?')
r.findall(dateEntries)
# ['Mar 20, 2009', 'March 20, 2009', 'Mar. 20, 2009', 'Mar 20 2009', '20 Mar 2009', '20 March 2009', '2 Mar. 2009', '20 March, 2009', 'Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009', 'Feb 2009', 'Sep 2009', 'Oct 2010']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With