Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup different parsers

could anyone elaborate more about the difference between parsers like html.parser and html5lib? I've stumbled across a weird behavior where when using html.parser it ignores all the tags in specific place. look at this code

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
print(tags)

this will return an empty list, whereas when using html5lib, the desired "a" tags are returned as expected. does anyone know the reason for that ?

I've read the documentation but the explanation about the different parsers is pretty vague..

Also I've noticed that html5lib ignores invalid tags like nested form tags, is there a way to use html5lib to avoid the above behavior with html.parser and also get invalid tags like nested form tags? (when parsing with html5lib one of the form tags are removed)

thanks in advance.

like image 543
IlanL Avatar asked Nov 06 '22 17:11

IlanL


1 Answers

You can use lxml which is very fast and can use find_all or select to get all tags.

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
print(tags)

OR

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)
like image 74
KunduK Avatar answered Nov 11 '22 18:11

KunduK