I'm doing some HTML cleaning with BeautifulSoup. Noob to both Python & BeautifulSoup. I've got tags being removed correctly as follows, based on an answer I found elsewhere on Stackoverflow:
[s.extract() for s in soup('script')]
But how to remove inline styles? For instance the following:
<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">
Should become:
<p>Text</p>
<img href="somewhere.com">
How to delete the inline class, id, name & style attributes of all elements?
Answers to other similar questions I could find all mentioned using a CSS parser to handle this, rather than BeautifulSoup, but as the task is simply to remove rather than manipulate the attributes, and is a blanket rule for all tags, I was hoping to find a way to do it all within BeautifulSoup.
You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
Also, if you just want to delete entire tags (and their contents), you don't need extract()
, which returns the tag. You just need decompose()
:
[tag.decompose() for tag in soup("script")]
Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.
I wouldn't do this in BeautifulSoup
- you'll spend a lot of time trying, testing, and working around edge cases.
Bleach
does exactly this for you. http://pypi.python.org/pypi/bleach
If you were to do this in BeautifulSoup
, I'd suggest you go with the "whitelist" approach, like Bleach
does. Decide which tags may have which attributes, and strip every tag/attribute that doesn't match.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With