Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python split at tag regex

Tags:

python

regex

I'm trying to split these lines:

<label>Olympic Games</label>
<title>Next stop</title>

Into:

["<label>", "Olympic Games", "</label>"]
["<title>", "Next stop", "</title>"]

In Python I can use regular expressions but what I've made doesn't do anything:

line.split("<\*>")
like image 383
Steven Avatar asked Jan 04 '23 11:01

Steven


2 Answers

Using lookarounds and a capture group to keep the text after splitting:

re.split(r'(?<=>)(.+?)(?=<)', '<label>Olympic Games</label>')
like image 123
Aran-Fey Avatar answered Jan 13 '23 11:01

Aran-Fey


This regex works for me:

<(label|title)>([^<]*)</(label|title)>

or, as cwallenpoole suggested:

<(label|title)>([^<]*)</(\1)>

enter image description here

I've used http://www.regexpal.com/

I have used three capturing groups, if you don't need them, simply remove the ()

What is wrong about your regex <\*> is that is matching only one thing: <*>. You have scaped * using \*, so what you are saying is:

  • Match any text with <, then a * and then a >.
like image 30
Alejandro Alcalde Avatar answered Jan 13 '23 12:01

Alejandro Alcalde