Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML parser for GAE

Generally I use lxml for my HTML parsing needs, but that isn't available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results.

Which pure Python HTML parser have you found performs best? My priority is the ability to handle bad HTML over speed.

like image 260
hoju Avatar asked Jan 29 '10 11:01

hoju


People also ask

How do you parse in HTML?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.

What is HTML parser in C?

HTML Parser in C/C++ HTML Parser is a program/software by which useful statements can be extracted, leaving html tags (like <h1>, <span>, <p> etc) behind. Examples: Input: <h1>Geeks for Geeks</h1> Output: Geeks for Geeks.


2 Answers

From the BeautifulSoup documentation:

Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.8 does

So, it might help you to use this earlier version. That is precisely what the author himself recommends.

You can pretend that Beautiful Soup version 3.1.0 was never released. Version 3.0.8 still works fine on Python 2.3 through 2.6.

like image 129
lprsd Avatar answered Oct 03 '22 06:10

lprsd


No longer a problem - lxml is supported: https://developers.google.com/appengine/docs/python/tools/libraries27

like image 37
hoju Avatar answered Oct 03 '22 06:10

hoju