Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex pattern to match datetime in python

I have a string contains datetimes, I am trying to split the string based on the datetime occurances,

data="2018-03-14 06:08:18, he went on \n2018-03-15 06:08:18, lets play"

what I am doing,

out=re.split('^(2[0-3]|[01]?[0-9]):([0-5]?[0-9]):([0-5]?[0-9])$',data)

what I get

["2018-03-14 06:08:18, he went on 2018-03-15 06:08:18, lets play"]

What I want:

["2018-03-14 06:08:18, he went on","2018-03-15 06:08:18, lets play"]
like image 449
Vicky Avatar asked Mar 05 '23 19:03

Vicky


2 Answers

You want to split with at least 1 whitespace followed with a date like pattern, thus, you may use

re.split(r'\s+(?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b)', s)

See the regex demo

Details

  • \s+ - 1+ whitespace chars
  • (?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b) - a positive lookahead that makes sure, that immediately to the left of the current location, there are
    • \d{2}(?:\d{2})? - 2 or 4 digits
    • - - a hyphen
    • \d{1,2} - 1 or 2 digits
    • -\d{1,2} - again a hyphen and 1 or 2 digits
    • \b - a word boundary (if not necessary, remove it, or replace with (?!\d) in case you may have dates glued to letters or other text)

Python demo:

import re
rex = r"\s+(?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b)"
s = "2018-03-14 06:08:18, he went on 2018-03-15 06:08:18, lets play"
print(re.split(rex, s))
# => ['2018-03-14 06:08:18, he went on', '2018-03-15 06:08:18, lets play']

NOTE If there can be no whitespace before the date, in Python 3.7 and newer you may use r"\s*(?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b)" (note the * quantifier with \s* that will allow zero-length matches). For older versions, you will need to use a solution as @blhsing suggests or install PyPi regex module and use r"(?V1)\s*(?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b)" with regex.split.

like image 123
Wiktor Stribiżew Avatar answered Mar 09 '23 00:03

Wiktor Stribiżew


re.split is meant for cases where you have a certain delimiter pattern. Use re.findall with a lookahead pattern instead:

import re
data="2018-03-14 06:08:18, he went on \n2018-03-15 06:08:18, lets play"
d = r'\d{4}-\d?\d-\d?\d (?:2[0-3]|[01]?[0-9]):[0-5]?[0-9]:[0-5]?[0-9]'
print(re.findall(r'{0}.*?(?=\s*{0}|$)'.format(d), data, re.DOTALL))

This outputs:

['2018-03-14 06:08:18, he went on', '2018-03-15 06:08:18, lets play']
like image 39
blhsing Avatar answered Mar 08 '23 23:03

blhsing