HTML parser for GAE

Tags:

Generally I use lxml for my HTML parsing needs, but that isn't available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results.

Which pure Python HTML parser have you found performs best? My priority is the ability to handle bad HTML over speed.

260

asked Jan 29 '10 11:01

hoju

2 Answers

From the BeautifulSoup documentation:

Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.8 does

So, it might help you to use this earlier version. That is precisely what the author himself recommends.

You can pretend that Beautiful Soup version 3.1.0 was never released. Version 3.0.8 still works fine on Python 2.3 through 2.6.

129

answered Oct 03 '22 06:10

lprsd

No longer a problem - lxml is supported: https://developers.google.com/appengine/docs/python/tools/libraries27

answered Oct 03 '22 06:10

hoju

Related questions
                            
                                Formatting csv file data with html template
                            
                                how to use french letters in a django template?
                            
                                Run web.py as daemon
                            
                                How to resume program (or exit) after opening webbrowser?
                            
                                Simple python / Beautiful Soup type question
                            
                                Using select/poll/kqueue/kevent to watch a directory for new files
                            
                                Another absolute import problem
                            
                                HTTP POST binary files using Python: concise non-pycurl examples?
                            
                                Python - capture Popen stdout AND display on console?
                            
                                python 3.1 with pydev
                            
                                How to determine whether java is installed on a system through python?
                            
                                How to install setuptools?
                            
                                question related to reverse function and kwargs
                            
                                Python email lib - How to remove attachment from existing message?
                            
                                Supervisord RPC - UNKNOWN_METHOD on any request
                            
                                Displaying OpenCV iplimage data structures with wxPython
                            
                                A function callback every time a key is pressed (regardless of which window has focus)?
                            
                                Windows Server cannot execute a py2exe-generated app
                            
                                Get remote MAC address using Python and Linux
                            
                                how many characters are there in line in a console?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HTML parser for GAE

Tags:

python

html-parsing

google-app-engine

lxml

hoju

People also ask

2 Answers

lprsd

hoju

Recent Activity

Donate For Us