using beautifulsoup with html5lib, it puts the html, head and body tags automatically:
BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>
is there any option that I can set, turn off this behavior ?
Omitting the html , head , and body tags is certainly allowed by the HTML specifications. The underlying reason is that browsers have always sought to be consistent with existing web pages, and the very early versions of HTML didn't define those elements.
It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.
The <head> tag in HTML is used to define the head portion of the document which contains information related to the document. The <head> tag contains other head elements such as <title>, <meta>, <link>, <style> <link> etc. In HTML 4.01 the <head> element was mandatory but in HTML5, the <head> element can be omitted.
In [35]: import bs4 as bs In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser") Out[36]: <h1>FOO</h1>
This parses the HTML with Python's builtin HTML parser. Quoting the docs:
Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a
<body>
tag. Unlike lxml, it doesn’t even bother to add an<html>
tag.
Alternatively, you could use the html5lib
parser and just select the element after <body>
:
In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib') In [62]: soup.body.next Out[62]: <h1>FOO</h1>
Let's first create a soup sample:
soup=BeautifulSoup("<head></head><body><p>content</p></body>")
You could get html and body's child by specify soup.body.<tag>
:
# python3: get body's first child print(next(soup.body.children)) # if first child's tag is rss print(soup.body.rss)
Also you could use unwrap() to remove body, head, and html
soup.html.body.unwrap() if soup.html.select('> head'): soup.html.head.unwrap() soup.html.unwrap()
If you load xml file, bs4.diagnose(data)
will tell you to use lxml-xml
, which will not wrap your soup with html+body
>>> BS('<foo>xxx</foo>', 'lxml-xml') <foo>xxx</foo>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With