<p>using beautifulsoup with html5lib, it puts the html, head and body tags automatically:</p> <pre class="prettyprint"><code>BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html> </code></pre> <p>is there any option that I can set, turn off this behavior ?</p>

<pre class="prettyprint"><code>In [35]: import bs4 as bs In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser") Out[36]: <h1>FOO</h1> </code></pre> <p>This parses the HTML with Python's builtin HTML parser. Quoting the docs:</p> <blockquote> <p>Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a <code><body></code> tag. Unlike lxml, it doesn’t even bother to add an <code><html></code> tag.</p> </blockquote> <hr> <p>Alternatively, you could use the <code>html5lib</code> parser and just select the element after <code><body></code>:</p> <pre class="prettyprint"><code>In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib') In [62]: soup.body.next Out[62]: <h1>FOO</h1> </code></pre>

<p>Let's first create a soup sample:</p> <pre class="prettyprint"><code>soup=BeautifulSoup("<head></head><body><p>content</p></body>") </code></pre> <p>You could get html and body's child by specify <code>soup.body.<tag></code>:</p> <pre class="prettyprint"><code># python3: get body's first child print(next(soup.body.children)) # if first child's tag is rss print(soup.body.rss) </code></pre> <p>Also you could use <strong>unwrap()</strong> to remove body, head, and html</p> <pre class="prettyprint"><code>soup.html.body.unwrap() if soup.html.select('> head'): soup.html.head.unwrap() soup.html.unwrap() </code></pre> <p>If you load xml file, <code>bs4.diagnose(data)</code> will tell you to use <code>lxml-xml</code>, which will not wrap your soup with <code>html+body</code></p> <pre class="prettyprint"><code>>>> BS('<foo>xxx</foo>', 'lxml-xml') <foo>xxx</foo> </code></pre>

Don't put html, head and body tags automatically, beautifulsoup

Tags:

python

beautifulsoup

html5lib

using beautifulsoup with html5lib, it puts the html, head and body tags automatically:

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>

is there any option that I can set, turn off this behavior ?

642

asked Feb 11 '13 22:02

Bengineer

2 Answers

In [35]: import bs4 as bs  In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser") Out[36]: <h1>FOO</h1>

This parses the HTML with Python's builtin HTML parser. Quoting the docs:

Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a <body> tag. Unlike lxml, it doesn’t even bother to add an <html> tag.

Alternatively, you could use the html5lib parser and just select the element after <body>:

In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')  In [62]: soup.body.next Out[62]: <h1>FOO</h1>

173

answered Oct 09 '22 11:10

unutbu

Let's first create a soup sample:

soup=BeautifulSoup("<head></head><body><p>content</p></body>")

You could get html and body's child by specify soup.body.<tag>:

# python3: get body's first child print(next(soup.body.children))  # if first child's tag is rss print(soup.body.rss)

Also you could use unwrap() to remove body, head, and html

soup.html.body.unwrap() if soup.html.select('> head'):     soup.html.head.unwrap() soup.html.unwrap()

If you load xml file, bs4.diagnose(data) will tell you to use lxml-xml, which will not wrap your soup with html+body

>>> BS('<foo>xxx</foo>', 'lxml-xml') <foo>xxx</foo>

answered Oct 09 '22 13:10

ahuigo

Related questions
                            
                                Access memory address in python
                            
                                Expand tabs to spaces in vim only in python files?
                            
                                Adding indexes to SQLAlchemy models after table creation
                            
                                how to get field type string from db model in django
                            
                                Ranking order per group in Pandas
                            
                                RabbitMQ: How to send Python dictionary between Python producer and consumer?
                            
                                Include my markdown README into Sphinx
                            
                                join two lists of dictionaries on a single key
                            
                                Python remove anything that is not a letter or number
                            
                                How to inspect and cancel Celery tasks by task name
                            
                                sqlalchemy,creating an sqlite database if it doesn't exist
                            
                                Get timezone used by datetime.datetime.fromtimestamp()
                            
                                ValueError: unconverted data remains: 02:05
                            
                                What is under the hood of x = 'y' 'z' in Python?
                            
                                How do I create a Django form that displays a checkbox label to the right of the checkbox?
                            
                                How to create a class instance without calling initializer?
                            
                                Remove duplicates in list of object with Python
                            
                                How do I separate my models out in django?
                            
                                When do you use 'self' in Python?
                            
                                How can I loop over entries in JSON?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With