Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Don't put html, head and body tags automatically, beautifulsoup

using beautifulsoup with html5lib, it puts the html, head and body tags automatically:

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html> 

is there any option that I can set, turn off this behavior ?

like image 642
Bengineer Avatar asked Feb 11 '13 22:02

Bengineer


People also ask

Is it necessary to write head body and HTML tags?

Omitting the html , head , and body tags is certainly allowed by the HTML specifications. The underlying reason is that browsers have always sought to be consistent with existing web pages, and the very early versions of HTML didn't define those elements.

Can BeautifulSoup handle broken HTML?

It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.

Can HTML have no head?

The <head> tag in HTML is used to define the head portion of the document which contains information related to the document. The <head> tag contains other head elements such as <title>, <meta>, <link>, <style> <link> etc. In HTML 4.01 the <head> element was mandatory but in HTML5, the <head> element can be omitted.


2 Answers

In [35]: import bs4 as bs  In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser") Out[36]: <h1>FOO</h1> 

This parses the HTML with Python's builtin HTML parser. Quoting the docs:

Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a <body> tag. Unlike lxml, it doesn’t even bother to add an <html> tag.


Alternatively, you could use the html5lib parser and just select the element after <body>:

In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')  In [62]: soup.body.next Out[62]: <h1>FOO</h1> 
like image 173
unutbu Avatar answered Oct 09 '22 11:10

unutbu


Let's first create a soup sample:

soup=BeautifulSoup("<head></head><body><p>content</p></body>") 

You could get html and body's child by specify soup.body.<tag>:

# python3: get body's first child print(next(soup.body.children))  # if first child's tag is rss print(soup.body.rss) 

Also you could use unwrap() to remove body, head, and html

soup.html.body.unwrap() if soup.html.select('> head'):     soup.html.head.unwrap() soup.html.unwrap() 

If you load xml file, bs4.diagnose(data) will tell you to use lxml-xml, which will not wrap your soup with html+body

>>> BS('<foo>xxx</foo>', 'lxml-xml') <foo>xxx</foo> 
like image 43
ahuigo Avatar answered Oct 09 '22 13:10

ahuigo