How to parse malformed HTML in python, using standard libraries

Q: How do you parse HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

Q: What is parsing in Python?

In this article, parsing is defined as the processing of a piece of python program and converting these codes into machine language. In general, we can say parse is a command for dividing the given program code into a small piece of code for analyzing the correct syntax.

Q: Which of the following parsers has an external C dependency?

Since html5lib is a pure-python library, it has an external Python Dependency while lxml being a binding for certain C libraries has external C dependency.

Tags:

python

html

dom

parsing

html-parsing

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.

I've found plenty of great third-party libraries for this task, but this question is about the python standard library.

Requirements:

Use only Python standard library components (any 2.x version)
DOM support
Handle HTML entities ( )
Handle partial documents (like: Hello, <i>World</i>!)

Bonus points:

XPATH support
Handle unclosed/malformed tags. (<big>does anyone here know <html ???

Here's my 90% solution, as requested. This works for the limited set of HTML I've tried, but as everyone can plainly see, this isn't exactly robust. Since I did this by staring at the docs for 15 minutes and one line of code, I thought I would be able to consult the stackoverflow community for a similar but better solution...

from xml.etree.ElementTree import fromstring DOM = fromstring("<html>%s</html>" % html.replace('&nbsp;', '&#160;'))

281

asked Apr 20 '10 16:04

bukzor

1 Answers

Parsing HTML reliably is a relatively modern development (weird though that may seem). As a result there is definitely nothing in the standard library. HTMLParser may appear to be a way to handle HTML, but it's not -- it fails on lots of very common HTML, and though you can work around those failures there will always be another case you haven't thought of (if you actually succeed at handling every failure you'll have basically recreated BeautifulSoup).

There are really only 3 reasonable ways to parse HTML (as it is found on the web): lxml.html, BeautifulSoup, and html5lib. lxml is the fastest by far, but can be a bit tricky to install (and impossible in an environment like App Engine). html5lib is based on how HTML 5 specifies parsing; though similar in practice to the other two, it is perhaps more "correct" in how it parses broken HTML (they all parse pretty-good HTML the same). They all do a respectable job at parsing broken HTML. BeautifulSoup can be convenient though I find its API unnecessarily quirky.

170

answered Sep 19 '22 12:09

Ian Bicking

Related questions
                            
                                How do I send HTML Formatted emails, through the gmail-api for python
                            
                                What is the difference between postgres and postgresql_psycopg2 as a database engine for django?
                            
                                Use lambda expression to count the elements that I'm interested in Python
                            
                                How to make a copy of a python module at runtime?
                            
                                Simulate Python keypresses for controlling a game
                            
                                Retrieve name of column from its Index in Pandas
                            
                                Cross-platform subprocess with hidden window
                            
                                Recursive list comprehension in Python?
                            
                                How come unpacking is faster than accessing by index?
                            
                                Time complexity of string slice
                            
                                addCleanup vs tearDown
                            
                                How to get tfidf with pandas dataframe?
                            
                                Converting a NumPy array to a PIL image
                            
                                How to remove bad path characters in Python?
                            
                                Should I create each class in its own .py file?
                            
                                What is the difference between root.destroy() and root.quit()?
                            
                                How do I override a Python import?
                            
                                How to subclass Python list without type problems?
                            
                                django redirect after login not working "next" not posting? [duplicate]
                            
                                Should I put pyc files under version control?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With