<p>In BeautifulSoup versions prior to 3 I could take any chunk of HTML and get a string representation in this way:</p> <pre class="prettyprint"><code>from BeautifulSoup import BeautifulSoup soup3 = BeautifulSoup('<div><b>soup 3</b></div>') print unicode(soup3) '<div><b>soup</b></div>' </code></pre> <p>However with BeautifulSoup4 the same operation creates additional tags:</p> <pre class="prettyprint"><code>from bs4 import BeautifulSoup soup4 = BeautifulSoup('<div><b>soup 4</b></div>') print unicode(soup4) '<html><body><div><b>soup 4</b></div></body></html>' ^^^^^^^^^^^^ ^^^^^^^^^^^^^^ </code></pre> <p>I don't need the outer <strong><code><html><body>..</body></html></code></strong> tags that BS4 is adding. I have looked through the BS4 docs and also searched inside the class but could not find any setting for supressing the extra tags in the output. How do I do it? Downgrading to v3 is not an option since the SGML parser used in BS3 is not near as good as the <code>lxml</code> or <code>html5lib</code> parsers that are available with BS4.</p>

<p>If you want your code to work on everyone's machine, no matter which parser(s) they have installed, etc. (the same <code>lxml</code> version built on <code>libxml2</code> 2.9 vs. 2.8 acts very differently, the stdlib <code>html.parser</code> had some radical changes between 2.7.2 and 2.7.3, …), you pretty much need to handle all of the legitimate results.</p> <p>If you know you have a fragment, something like this will give you exactly that fragment:</p> <pre class="prettyprint"><code>soup4 = BeautifulSoup('<div><b>soup 4</b></div>') if soup4.body: return soup4.body.next elif soup4.html: return soup4.html.next else: return soup4 </code></pre> <p>Of course if you know your fragment is a single <code>div</code>, it's even easier—but it's not as easy to think of a use case where you'd know that:</p> <pre class="prettyprint"><code>soup4 = BeautifulSoup('<div><b>soup 4</b></div>') return soup4.div </code></pre> <hr> <p>If you want to know <em>why</em> this happens:</p> <p><code>BeautifulSoup</code> is intended for parsing HTML documents. An HTML fragment is not a valid document. It's <em>pretty close</em> to a document, but that's not good enough to guarantee that you'll get back exactly what you give it.</p> <p>As Differences between parsers says:</p> <blockquote> <p>There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.</p> <p>But if the document is not perfectly-formed, different parsers will give different results.</p> </blockquote> <p>So, while this exact difference isn't documented, it's just a special case of something that is.</p>

How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup? [duplicate]

Tags:

python

beautifulsoup

In BeautifulSoup versions prior to 3 I could take any chunk of HTML and get a string representation in this way:

from BeautifulSoup import BeautifulSoup
soup3 = BeautifulSoup('<div><b>soup 3</b></div>')
print unicode(soup3)
    '<div><b>soup</b></div>'

However with BeautifulSoup4 the same operation creates additional tags:

from bs4 import BeautifulSoup
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
print unicode(soup4)
    '<html><body><div><b>soup 4</b></div></body></html>'
     ^^^^^^^^^^^^                        ^^^^^^^^^^^^^^

I don't need the outer <html><body>..</body></html> tags that BS4 is adding. I have looked through the BS4 docs and also searched inside the class but could not find any setting for supressing the extra tags in the output. How do I do it? Downgrading to v3 is not an option since the SGML parser used in BS3 is not near as good as the lxml or html5lib parsers that are available with BS4.

759

asked Apr 12 '13 21:04

ccpizza

2 Answers

If you want your code to work on everyone's machine, no matter which parser(s) they have installed, etc. (the same lxml version built on libxml2 2.9 vs. 2.8 acts very differently, the stdlib html.parser had some radical changes between 2.7.2 and 2.7.3, …), you pretty much need to handle all of the legitimate results.

If you know you have a fragment, something like this will give you exactly that fragment:

soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
if soup4.body:
    return soup4.body.next
elif soup4.html:
    return soup4.html.next
else:
    return soup4

Of course if you know your fragment is a single div, it's even easier—but it's not as easy to think of a use case where you'd know that:

soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
return soup4.div

If you want to know why this happens:

BeautifulSoup is intended for parsing HTML documents. An HTML fragment is not a valid document. It's pretty close to a document, but that's not good enough to guarantee that you'll get back exactly what you give it.

As Differences between parsers says:

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

But if the document is not perfectly-formed, different parsers will give different results.

So, while this exact difference isn't documented, it's just a special case of something that is.

130

answered Oct 14 '22 05:10

abarnert

As was noted in the old BeautifulStoneSoup documentation:

The BeautifulSoup class is full of web-browser-like heuristics for divining the intent of HTML authors. But XML doesn't have a fixed tag set, so those heuristics don't apply. So BeautifulSoup doesn't do XML very well.

Use the BeautifulStoneSoup class to parse XML documents. It's a general class with no special knowledge of any XML dialect and very simple rules about tag nesting...

And in the BeautifulSoup4 docs:

There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor. For the same reason, the BeautifulSoup constructor no longer recognizes the isHTML argument.

Perhaps that will yield what you want.

answered Oct 14 '22 06:10

msw

Related questions
                            
                                Implementation of functions with very basic scripting
                            
                                Python mutually dependent classes (circular dependencies)
                            
                                Is it possible to run commands in IPython with debugging?
                            
                                How do I check if a python function changed (in live code)?
                            
                                PIP install "error: package directory 'X' does not exist"
                            
                                '/' in names in HDF5 files confusion
                            
                                Python print Unicode character
                            
                                seaborn violinplots: change violin color, axes names, legend
                            
                                Exporting single iPython Jupyter Notebook cell output
                            
                                TensorFlow: Is there a way to convert a frozen graph into a checkpoint model?
                            
                                When are python sunder names used?
                            
                                How can I debug a python code in a virtual environment using VSCode?
                            
                                VS Code Python autopep8 does not honor 2 spaces hanging indentation
                            
                                IronPython Webframework
                            
                                PyQT GUI Testing
                            
                                Is there a plugin for vim to auto-import python libraries? [closed]
                            
                                What is the easiest way to generate a Control Flow-Graph for a method in Python?
                            
                                semantics of __module__
                            
                                Using FieldList and FormField
                            
                                Optimizing the size of embedded Python interpreter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With