could anyone elaborate more about the difference between parsers like html.parser and html5lib? I've stumbled across a weird behavior where when using html.parser it ignores all the tags in specific place. look at this code
from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
<![endif]-->
<a href="test"></a>
<a href="test"></a>
<a href="test"></a>
<a href="test"></a>
<!--[if lte IE 8]>
<![endif]-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
print(tags)
this will return an empty list, whereas when using html5lib, the desired "a" tags are returned as expected. does anyone know the reason for that ?
I've read the documentation but the explanation about the different parsers is pretty vague..
Also I've noticed that html5lib ignores invalid tags like nested form tags, is there a way to use html5lib to avoid the above behavior with html.parser and also get invalid tags like nested form tags? (when parsing with html5lib one of the form tags are removed)
thanks in advance.
You can use lxml
which is very fast and can use find_all
or select
to get all tags.
from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
<![endif]-->
<a href="test"></a>
<a href="test"></a>
<a href="test"></a>
<a href="test"></a>
<!--[if lte IE 8]>
<![endif]-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
print(tags)
OR
from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
<![endif]-->
<a href="test"></a>
<a href="test"></a>
<a href="test"></a>
<a href="test"></a>
<!--[if lte IE 8]>
<![endif]-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With