Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?

When using Beautiful Soup what is the difference between 'lxml' and "html.parser" and "html5lib"?

When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I should be using a different one. I'd like to strengthen my understanding; I've read a couple posts on here about this but they're not going over the uses much in any at all.

Example:

soup = BeautifulSoup(response.text, 'lxml')
like image 276
duc hathaway Avatar asked Aug 03 '17 21:08

duc hathaway


People also ask

What is the difference between HTML parser and lxml?

lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.

What is lxml in BeautifulSoup?

lxml can make use of BeautifulSoup as a parser backend, just like BeautifulSoup can employ lxml as a parser. When using BeautifulSoup from lxml, however, the default is to use Python's integrated HTML parser in the html. parser module.

What is lxml parser?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

What does BeautifulSoup HTML parser do?

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


2 Answers

From the docs's summarized table of advantages and disadvantages:

  1. html.parser - BeautifulSoup(markup, "html.parser")

    • Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)

    • Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

  2. lxml - BeautifulSoup(markup, "lxml")

    • Advantages: Very fast, Lenient

    • Disadvantages: External C dependency

  3. html5lib - BeautifulSoup(markup, "html5lib")

    • Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5

    • Disadvantages: Very slow, External Python dependency

like image 138
Vinícius Figueiredo Avatar answered Oct 07 '22 00:10

Vinícius Figueiredo


The key differences are highlighted in the BeautifulSoup documentation:

  • Differences between parsers

The basic reasoning why would you prefer one parser instead of others:

  • html.parser- built-in - no extra dependencies needed
  • html5lib - the most lenient - better use it if HTML is broken
  • lxml - the fastest
like image 21
alecxe Avatar answered Oct 06 '22 23:10

alecxe