Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clean Up HTML in Python

I'm aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?

like image 484
Joel Avatar asked Jun 19 '10 00:06

Joel


People also ask

How do I clean up HTML in Python?

Cleaner module. Requires the lxml module — pip install lxml (it's a native module written in C so it might be faster than pure python solutions). Check out the docs for a full list of options you can pass to the Cleaner. how it can clean from code tags (div) with specific 'id' or 'class'? (completely, include text).


1 Answers

I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.

from bs4 import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.

like image 185
JudoWill Avatar answered Oct 05 '22 10:10

JudoWill