BeautifulSoup different parsers

Question

could anyone elaborate more about the difference between parsers like html.parser and html5lib? I've stumbled across a weird behavior where when using html.parser it ignores all the tags in specific place. look at this code

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
print(tags)

this will return an empty list, whereas when using html5lib, the desired "a" tags are returned as expected. does anyone know the reason for that ?

I've read the documentation but the explanation about the different parsers is pretty vague..

Also I've noticed that html5lib ignores invalid tags like nested form tags, is there a way to use html5lib to avoid the above behavior with html.parser and also get invalid tags like nested form tags? (when parsing with html5lib one of the form tags are removed)

thanks in advance.

KunduK · Accepted Answer

You can use lxml which is very fast and can use find_all or select to get all tags.

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
print(tags)

OR

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)

BeautifulSoup different parsers

Tags:

python-3.x

beautifulsoup

IlanL

1 Answers

KunduK

Recent Activity

Donate For Us

BeautifulSoup different parsers

Tags:

python-3.x

beautifulsoup

IlanL

1 Answers

KunduK

Related questions

Recent Activity

Donate For Us