Generally I use lxml for my HTML parsing needs, but that isn't available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results.
Which pure Python HTML parser have you found performs best? My priority is the ability to handle bad HTML over speed.
HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.
HTML Parser in C/C++ HTML Parser is a program/software by which useful statements can be extracted, leaving html tags (like <h1>, <span>, <p> etc) behind. Examples: Input: <h1>Geeks for Geeks</h1> Output: Geeks for Geeks.
From the BeautifulSoup documentation:
Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.8 does
So, it might help you to use this earlier version. That is precisely what the author himself recommends.
You can pretend that Beautiful Soup version 3.1.0 was never released. Version 3.0.8 still works fine on Python 2.3 through 2.6.
No longer a problem - lxml is supported: https://developers.google.com/appengine/docs/python/tools/libraries27
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With