Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to remove conditional comments

Tags:

python

regex

I want a regex which can match conditional comments in a HTML source page so I can remove only those. I want to preserve the regular comments.

I would also like to avoid using the .*? notation if possible.

The text is

foo

<!--[if IE]>

<style type="text/css">

ul.menu ul li{
    font-size: 10px;
    font-weight:normal;
    padding-top:0px;
}

</style>

<![endif]-->

bar

and I want to remove everything in <!--[if IE]> and <![endif]-->

EDIT: It is because of BeautifulSoup I want to remove these tags. BeautifulSoup fails to parse and gives an incomplete source

EDIT2: [if IE] isn't the only condition. There are lots more and I don't have any list of all possible combinations.

EDIT3: Vinko Vrsalovic's solution works, but the actual problem why beautifulsoup failed was because of a rogue comment within the conditional comment. Like

<!--[if lt IE 7.]>
<script defer type="text/javascript" src="pngfix_253168.js"></script><!--png fix for IE-->
<![endif]-->

Notice the <!--png fix for IE--> comment?

Though my problem was solve, I would love to get a regex solution for this.

like image 696
cnu Avatar asked Sep 25 '08 10:09

cnu


1 Answers

>>> from BeautifulSoup import BeautifulSoup, Comment
>>> html = '<html><!--[if IE]> bloo blee<![endif]--></html>'
>>> soup = BeautifulSoup(html)
>>> comments = soup.findAll(text=lambda text:isinstance(text, Comment) 
               and text.find('if') != -1) #This is one line, of course
>>> [comment.extract() for comment in comments]
[u'[if IE]> bloo blee<![endif]']
>>> print soup.prettify()
<html>
</html>
>>>     

python 3 with bf4:

from bs4 import BeautifulSoup, Comment
html = '<html><!--[if IE]> bloo blee<![endif]--></html>'
soup = BeautifulSoup(html, "html.parser")
comments = soup.findAll(text=lambda text:isinstance(text, Comment) 
               and text.find('if') != -1) #This is one line, of course
[comment.extract() for comment in comments]
[u'[if IE]> bloo blee<![endif]']
print (soup.prettify())

If your data gets BeautifulSoup confused, you can fix it before hand or customize the parser, among other solutions.

EDIT: Per your comment, you just modify the lambda passed to findAll as you need (I modified it)

like image 111
Vinko Vrsalovic Avatar answered Oct 17 '22 06:10

Vinko Vrsalovic