Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python remove everything between <div class="comment> .. any... </div>

Tags:

python

html

class

how do you use python 2.6 to remove everything including the <div class="comment"> ....remove all ....</div>

i tried various way using re.sub without any success

Thank you

like image 850
Michelle Jun Lee Avatar asked Apr 15 '10 23:04

Michelle Jun Lee


1 Answers

This can be done easily and reliably using an HTML parser like BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<body><div>1</div><div class="comment"><strong>2</strong></div></body>')
>>> for div in soup.findAll('div', 'comment'):
...   div.extract()
... 
<div class="comment"><strong>2</strong></div>
>>> soup
<body><div>1</div></body>

See this question for examples on why parsing HTML using regular expressions is a bad idea.

like image 122
Ayman Hourieh Avatar answered Oct 15 '22 18:10

Ayman Hourieh