How do I remove class attributes from html using python and lxml?
I have:
<p class="DumbClass">Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>
I want:
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>
I've checked out lxml.html.clean.Cleaner however, it does not have a method to strip out class attributes. You can set safe_attrs_only=True
however, this does not remove the class attribute.
Significant searching has turned up nothing workable. I think the fact that class
is used in both html and python further muddies search results. Many of the results also seem to deal strictly with xml as well.
I'm open to other python modules that offer humane interfaces as well.
Thanks much.
Thanks to @Dan Roberts answer below, I came up with the following solution. Presented for folks arriving here in the future trying to solve the same problem.
import lxml.html
# Our html string we want to remove the class attribute from
html_string = '<p class="DumbClass">Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>'
# Parse the html
html = lxml.html.fromstring(html_string)
# Print out our "Before"
print lxml.html.tostring(html)
# .xpath below gives us a list of all elements that have a class attribute
# xpath syntax explained:
# // = select all tags that match our expression regardless of location in doc
# * = match any tag
# [@class] = match all class attributes
for tag in html.xpath('//*[@class]'):
# For each element with a class attribute, remove that class attribute
tag.attrib.pop('class')
# Print out our "After"
print lxml.html.tostring(html)
I can't test this at the moment but this appears to be the general idea
for tag in node.xpath('//*[@class]'):
tag.attrib.pop('class')
lxml.html.clean.Cleaner does work, but needs proper configuration.
import lxml.html
from lxml.html import clean
html_string = '<p id="test" class="DumbClass">Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>'
tree = html.fromstring(html_string)
cleaner = html.clean.Cleaner()
cleaner.safe_attrs_only = True
cleaner.safe_attrs=frozenset(['id'])
cleaned = cleaner.clean_html(tree)
print(html.tostring(cleaned))
Result in :
b'<p id="test">Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With