Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex: remove certain HTML tags and the contents in them

If I have a string that contains this:

<p><span class=love><p>miracle</p>...</span></p><br>love</br>

And I want to remove the string:

<span class=love><p>miracle</p>...</span>

and maybe some other HTML tags. At the same time, the other tags and the contents in them will be reserved.

The result should be like this:

<p></p><br>love</br>

I want to know how to do this using regex pattern? what I have tried :

r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)

but it will leave the

</span>

can you help me using re module this time?and i will learn html parser next

like image 733
mjc Avatar asked Oct 11 '25 19:10

mjc


1 Answers

First things first: Don’t parse HTML using regular expressions

That being said, if there is no additional span tag within that span tag, then you could do it like this:

text = re.sub('<span class=love>.*?</span>', '', text)

On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is).


The expression you have tried, <span class=love>.*?(?=</span>), is already quite good. The problem is that the lookahead (?=</span>) will never match what it looks ahead for. So the expression will stop immediately before the closing span tag. You now could manually add a closing span at the end, i.e. <span class=love>.*?(?=</span>)</span>, but that’s not really necessary: The .*? is a non-greedy expression. It will try to match as little as possible. So in .*?</span> the .*? will only match until a closing span is found where it immediately stops.

like image 53
poke Avatar answered Oct 14 '25 11:10

poke