I'm parsing <code>HTML</code> with BeautifulSoup. At the end, I would like to obtain the <code>body</code> contents, but without the <code>body</code> tags. But BeautifulSoup adds <code>html</code>, <code>head</code>, and <code>body</code> tags. I this googlegrops discussion one possible solution is proposed: <pre class="prettyprint"><code>>>> from bs4 import BeautifulSoup as Soup >>> soup = Soup('Some paragraph') >>> soup.body.hidden = True >>> soup.body.prettify() u' \n Some paragraph\n ' </code></pre> This solution is a hack. There should be a better and obvious way to do it.

I've found the easiest way to get just the contents of the body is to <code>unwrap()</code> your contents from inside the body tags. <pre class="prettyprint"><code>>>> html = "Hello World" >>> soup = BeautifulSoup(html, "html5lib") >>> print(soup) <html><head></head><body>Hello World</body></html> >>> >>> soup.html.unwrap() <html></html> >>> >>> print(soup) <head></head><body>Hello World</body> >>> >>> soup.head.unwrap() <head></head> >>> >>> print(soup) <body>Hello World</body> >>> >>> soup.body.unwrap() <body></body> >>> >>> print(soup) Hello World </code></pre> To be more efficient and reusable you could put those undesirable elements in a list and loop through them... <pre class="prettyprint"><code>>>> def get_body_contents(html): ... soup = BeautifulSoup(html, "html5lib") ... for attr in ['head','html','body']: ... if hasattr(soup, attr): ... getattr(soup, attr).unwrap() ... return soup >>> >>> html = "Hello World" >>> print(get_body_contents(html)) Hello World </code></pre>

BeautifulSoup - how should I obtain the body contents

Tags:

I'm parsing HTML with BeautifulSoup. At the end, I would like to obtain the body contents, but without the body tags. But BeautifulSoup adds html, head, and body tags. I this googlegrops discussion one possible solution is proposed:

>>> from bs4 import BeautifulSoup as Soup
>>> soup = Soup('<p>Some paragraph</p>')
>>> soup.body.hidden = True
>>> soup.body.prettify()
u' <p>\n  Some paragraph\n </p>'

This solution is a hack. There should be a better and obvious way to do it.

513

asked Jan 30 '14 09:01

Philipp Zedler

2 Answers

Do you mean getting everything inbetween the body tags?

In this case you can use :

import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen('some_site').read()
soup = BeautifulSoup(page)
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)

146

answered Oct 10 '22 05:10

Azwr

I've found the easiest way to get just the contents of the body is to unwrap() your contents from inside the body tags.

>>> html = "<p>Hello World</p>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print(soup)
<html><head></head><body><p>Hello World</p></body></html>
>>>
>>> soup.html.unwrap()
<html></html>
>>>
>>> print(soup)
<head></head><body><p>Hello World</p></body>
>>>
>>> soup.head.unwrap()
<head></head>
>>>
>>> print(soup)
<body><p>Hello World</p></body>
>>>
>>> soup.body.unwrap()
<body></body>
>>>
>>> print(soup)
<p>Hello World</p>

To be more efficient and reusable you could put those undesirable elements in a list and loop through them...

>>> def get_body_contents(html):
...  soup = BeautifulSoup(html, "html5lib")
...  for attr in ['head','html','body']:
...    if hasattr(soup, attr):
...      getattr(soup, attr).unwrap()
...  return soup
>>>
>>> html = "<p>Hello World</p>"
>>> print(get_body_contents(html))
<p>Hello World</p>

answered Oct 10 '22 05:10

Jeremy

Related questions
                            
                                How to use ng-animate in angular 1.2?
                            
                                Fine-tuning parameters in Logistic Regression
                            
                                Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do
                            
                                Ignoring Bash pipefail for error code 141
                            
                                How to treat std::pair as two separate variables?
                            
                                Subscribe and Read MQTT Message Using PAHO
                            
                                Chrome Dev Tools: View unminified CSS
                            
                                How to post form login using jsoup?
                            
                                Lambda can only be used with functional interface?
                            
                                Error in Visual Studio 2013: "No exports were found that match the constraint"
                            
                                How can I code a Created-201 response using IHttpActionResult
                            
                                How is ArrayDeque faster than stack?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With