Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML in Python [closed]

What's my best bet for parsing HTML if I can't use BeautifulSoup or lxml? I've got some code that uses SGMLlib but it's a bit low-level and it's now deprecated.

I would prefer if it could stomache a bit of malformed HTML although I'm pretty sure most of the input will be pretty clean.

like image 809
Andy Baker Avatar asked Apr 04 '09 18:04

Andy Baker


People also ask

What class does Python provide to parse HTML?

The HTMLParser class defined in this module provides functionality to parse HTML and XHMTL documents. This class contains handler methods that can identify tags, data, comments and other HTML elements.

How do you parse a HTML page?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.


1 Answers

Python has a native HTML parser, however the Tidy wrapper Nick suggested would probably be a solid choice as well. Tidy is a very common library, (written in C is it?)

like image 76
Andrei Taranchenko Avatar answered Sep 17 '22 15:09

Andrei Taranchenko