I'm parsing HTML
with BeautifulSoup. At the end, I would like to obtain the body
contents, but without the body
tags. But BeautifulSoup adds html
, head
, and body
tags. I this googlegrops discussion one possible solution is proposed:
>>> from bs4 import BeautifulSoup as Soup
>>> soup = Soup('<p>Some paragraph</p>')
>>> soup.body.hidden = True
>>> soup.body.prettify()
u' <p>\n Some paragraph\n </p>'
This solution is a hack. There should be a better and obvious way to do it.
BeautifulSoup has a built-in method to parse the text out of an element, which is get_text() . In order to use it, you can simply call the method on any Tag or BeautifulSoup object. get_text() does not work on NavigableString because the object itself represents a string.
For web scraping to work in Python, we're going to perform three basic steps: Extract the HTML content using the requests library. Analyze the HTML structure and identify the tags which have our content. Extract the tags using Beautiful Soup and put the data in a Python list.
Do you mean getting everything inbetween the body tags?
In this case you can use :
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen('some_site').read()
soup = BeautifulSoup(page)
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)
I've found the easiest way to get just the contents of the body is to unwrap()
your contents from inside the body tags.
>>> html = "<p>Hello World</p>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print(soup)
<html><head></head><body><p>Hello World</p></body></html>
>>>
>>> soup.html.unwrap()
<html></html>
>>>
>>> print(soup)
<head></head><body><p>Hello World</p></body>
>>>
>>> soup.head.unwrap()
<head></head>
>>>
>>> print(soup)
<body><p>Hello World</p></body>
>>>
>>> soup.body.unwrap()
<body></body>
>>>
>>> print(soup)
<p>Hello World</p>
To be more efficient and reusable you could put those undesirable elements in a list and loop through them...
>>> def get_body_contents(html):
... soup = BeautifulSoup(html, "html5lib")
... for attr in ['head','html','body']:
... if hasattr(soup, attr):
... getattr(soup, attr).unwrap()
... return soup
>>>
>>> html = "<p>Hello World</p>"
>>> print(get_body_contents(html))
<p>Hello World</p>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With