I'm aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?
Cleaner module. Requires the lxml module — pip install lxml (it's a native module written in C so it might be faster than pure python solutions). Check out the docs for a full list of options you can pass to the Cleaner. how it can clean from code tags (div) with specific 'id' or 'class'? (completely, include text).
I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.
from bs4 import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()
I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With