When using Beautiful Soup what is the difference between 'lxml' and "html.parser" and "html5lib"?
When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I should be using a different one. I'd like to strengthen my understanding; I've read a couple posts on here about this but they're not going over the uses much in any at all.
Example:
soup = BeautifulSoup(response.text, 'lxml')
lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.
lxml can make use of BeautifulSoup as a parser backend, just like BeautifulSoup can employ lxml as a parser. When using BeautifulSoup from lxml, however, the default is to use Python's integrated HTML parser in the html. parser module.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
From the docs's summarized table of advantages and disadvantages:
html.parser - BeautifulSoup(markup, "html.parser")
Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)
Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
lxml - BeautifulSoup(markup, "lxml")
Advantages: Very fast, Lenient
Disadvantages: External C dependency
html5lib - BeautifulSoup(markup, "html5lib")
Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
Disadvantages: Very slow, External Python dependency
The key differences are highlighted in the BeautifulSoup documentation:
The basic reasoning why would you prefer one parser instead of others:
html.parser
- built-in - no extra dependencies needed
html5lib
- the most lenient - better use it if HTML is brokenlxml
- the fastest
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With