Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup? [duplicate]

In BeautifulSoup versions prior to 3 I could take any chunk of HTML and get a string representation in this way:

from BeautifulSoup import BeautifulSoup
soup3 = BeautifulSoup('<div><b>soup 3</b></div>')
print unicode(soup3)
    '<div><b>soup</b></div>'

However with BeautifulSoup4 the same operation creates additional tags:

from bs4 import BeautifulSoup
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
print unicode(soup4)
    '<html><body><div><b>soup 4</b></div></body></html>'
     ^^^^^^^^^^^^                        ^^^^^^^^^^^^^^ 

I don't need the outer <html><body>..</body></html> tags that BS4 is adding. I have looked through the BS4 docs and also searched inside the class but could not find any setting for supressing the extra tags in the output. How do I do it? Downgrading to v3 is not an option since the SGML parser used in BS3 is not near as good as the lxml or html5lib parsers that are available with BS4.

like image 759
ccpizza Avatar asked Apr 12 '13 21:04

ccpizza


People also ask

Does BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.


2 Answers

If you want your code to work on everyone's machine, no matter which parser(s) they have installed, etc. (the same lxml version built on libxml2 2.9 vs. 2.8 acts very differently, the stdlib html.parser had some radical changes between 2.7.2 and 2.7.3, …), you pretty much need to handle all of the legitimate results.

If you know you have a fragment, something like this will give you exactly that fragment:

soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
if soup4.body:
    return soup4.body.next
elif soup4.html:
    return soup4.html.next
else:
    return soup4

Of course if you know your fragment is a single div, it's even easier—but it's not as easy to think of a use case where you'd know that:

soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
return soup4.div

If you want to know why this happens:

BeautifulSoup is intended for parsing HTML documents. An HTML fragment is not a valid document. It's pretty close to a document, but that's not good enough to guarantee that you'll get back exactly what you give it.

As Differences between parsers says:

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

But if the document is not perfectly-formed, different parsers will give different results.

So, while this exact difference isn't documented, it's just a special case of something that is.

like image 130
abarnert Avatar answered Oct 14 '22 05:10

abarnert


As was noted in the old BeautifulStoneSoup documentation:

The BeautifulSoup class is full of web-browser-like heuristics for divining the intent of HTML authors. But XML doesn't have a fixed tag set, so those heuristics don't apply. So BeautifulSoup doesn't do XML very well.

Use the BeautifulStoneSoup class to parse XML documents. It's a general class with no special knowledge of any XML dialect and very simple rules about tag nesting...

And in the BeautifulSoup4 docs:

There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor. For the same reason, the BeautifulSoup constructor no longer recognizes the isHTML argument.

Perhaps that will yield what you want.

like image 36
msw Avatar answered Oct 14 '22 06:10

msw