I am trying to clean up an HTML table using lxml.html.clean.Cleaner(). I need to strip JavaScript attributes, but would like to preserve inline CSS style. I thought style=False is the default setup:
import lxml.html.clean
cleaner = lxml.html.clean.Cleaner()
however when I call cleaner.clean_html(doc)
<span style="color:#008800;">67.51</span>
will become
<span>67.51</span>
Basically, style is not preserved. I tried to add:
cleaner.style= False
It doesn't help.
Update: I am using Python 2.6.6 + lxml 3.2.4 on Dreamhost, and Python 2.7.5 + lxml 3.2.4 on local Macbook. Same results. Another thing: there is a javacript-related attribute in my html:
<td style="cursor:pointer;">Ticker</td>
Could it be lxml stripped this JavaScript related style and treated other styles the same? I hope not.
It works if you set cleaner.safe_attrs_only = False
.
The set of "safe" attributes (Cleaner.safe_attrs
) is defined in the lxml.html.defs
module (source code) and style
is not included in the set.
But even better than cleaner.safe_attrs_only = False
is to use Cleaner(safe_attrs=lxml.html.defs.safe_attrs | set(['style']))
. This will preserve style
and at the same time protect from other unsafe attributes.
Demo code:
from lxml import html
from lxml.html import clean
s ='<marquee><span style="color: #008800;">67.51</span></marquee>'
doc = html.fromstring(s)
cleaner = clean.Cleaner(safe_attrs=html.defs.safe_attrs | set(['style']))
print html.tostring(cleaner.clean_html(doc))
Output:
<div><span style="color: #008800;">67.51</span></div>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With