BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?

Tags:

When using Beautiful Soup what is the difference between 'lxml' and "html.parser" and "html5lib"?

When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I should be using a different one. I'd like to strengthen my understanding; I've read a couple posts on here about this but they're not going over the uses much in any at all.

Example:

soup = BeautifulSoup(response.text, 'lxml')

276

asked Aug 03 '17 21:08

duc hathaway

2 Answers

From the docs's summarized table of advantages and disadvantages:

html.parser - BeautifulSoup(markup, "html.parser")
- Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)
- Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
lxml - BeautifulSoup(markup, "lxml")
- Advantages: Very fast, Lenient
- Disadvantages: External C dependency
html5lib - BeautifulSoup(markup, "html5lib")
- Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
- Disadvantages: Very slow, External Python dependency

138

answered Oct 07 '22 00:10

Vinícius Figueiredo

The key differences are highlighted in the BeautifulSoup documentation:

Differences between parsers

The basic reasoning why would you prefer one parser instead of others:

html.parser- built-in - no extra dependencies needed
html5lib - the most lenient - better use it if HTML is broken
lxml - the fastest

answered Oct 06 '22 23:10

alecxe

Related questions
                            
                                ASP.NET Core 2 + Get instance of db context
                            
                                How to get a random element from a list with stream api?
                            
                                Custom back indicator image and iOS 11
                            
                                Error:Could not find com.android.tools.build:gradle:3.3. Issue raise after upgrading gradle version for splunk:mint-android-sdk
                            
                                Android dependency '..' has different version for the compile (..) and runtime (..) classpath
                            
                                YouTube quotas exceeded
                            
                                input's event.target is null within this.setState [React.js]
                            
                                Undefined behaviour in repeated use of prefix ++ operator
                            
                                CRAN check warning: Dependence on R version '3.4.3' not with patchlevel 0
                            
                                WebAPI Core routing issues
                            
                                NPM WARN: [email protected] requires a peer of popper.js
                            
                                volatile struct = struct not possible, why?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?

Tags:

python

html

beautifulsoup

web-scraping

lxml

duc hathaway

People also ask

2 Answers

Vinícius Figueiredo

alecxe

Recent Activity

Donate For Us