What’s the most forgiving HTML parser in Python?

Question

I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems with 3.1.0 upwards), but the results are almost same.

I can recall several HTML parser options available in Python from the top of my head:

BeautifulSoup
lxml
pyquery

I intend to test all of these, but I wanted to know which one in your tests come as most forgiving and can even try to parse bad HTML.

Björn Lindqvist · Accepted Answer

They all are. I have yet to come across any html page found in the wild that lxml.html couldn't parse. If lxml barfs on the pages you're trying to parse you can always preprocess them using some regexps to keep lxml happy.

lxml itself is fairly strict, but lxml.html is a different parser and can deal with very broken html. For extremely brokeh html, lxml also ships with lxml.html.soupparser which interfaces with the BeautifulSoup library.

Some approaches to parsing broken html using lxml.html are described here: http://lxml.de/elementsoup.html

What’s the most forgiving HTML parser in Python?

Tags:

python

html-parsing

beautifulsoup

lxml

pyquery

Vaibhav Mishra

1 Answers

Björn Lindqvist

Recent Activity

Donate For Us

What’s the most forgiving HTML parser in Python?

Tags:

python

html-parsing

beautifulsoup

lxml

pyquery

Vaibhav Mishra

1 Answers

Björn Lindqvist

Related questions

Recent Activity

Donate For Us