Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup - how should I obtain the body contents

Tags:

I'm parsing HTML with BeautifulSoup. At the end, I would like to obtain the body contents, but without the body tags. But BeautifulSoup adds html, head, and body tags. I this googlegrops discussion one possible solution is proposed:

>>> from bs4 import BeautifulSoup as Soup
>>> soup = Soup('<p>Some paragraph</p>')
>>> soup.body.hidden = True
>>> soup.body.prettify()
u' <p>\n  Some paragraph\n </p>'

This solution is a hack. There should be a better and obvious way to do it.

like image 513
Philipp Zedler Avatar asked Jan 30 '14 09:01

Philipp Zedler


People also ask

How do you get the text of an element in BeautifulSoup?

BeautifulSoup has a built-in method to parse the text out of an element, which is get_text() . In order to use it, you can simply call the method on any Tag or BeautifulSoup object. get_text() does not work on NavigableString because the object itself represents a string.

How do you scrape text using BeautifulSoup?

For web scraping to work in Python, we're going to perform three basic steps: Extract the HTML content using the requests library. Analyze the HTML structure and identify the tags which have our content. Extract the tags using Beautiful Soup and put the data in a Python list.


2 Answers

Do you mean getting everything inbetween the body tags?

In this case you can use :

import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen('some_site').read()
soup = BeautifulSoup(page)
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)
like image 146
Azwr Avatar answered Oct 10 '22 05:10

Azwr


I've found the easiest way to get just the contents of the body is to unwrap() your contents from inside the body tags.

>>> html = "<p>Hello World</p>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print(soup)
<html><head></head><body><p>Hello World</p></body></html>
>>>
>>> soup.html.unwrap()
<html></html>
>>>
>>> print(soup)
<head></head><body><p>Hello World</p></body>
>>>
>>> soup.head.unwrap()
<head></head>
>>>
>>> print(soup)
<body><p>Hello World</p></body>
>>>
>>> soup.body.unwrap()
<body></body>
>>>
>>> print(soup)
<p>Hello World</p>

To be more efficient and reusable you could put those undesirable elements in a list and loop through them...

>>> def get_body_contents(html):
...  soup = BeautifulSoup(html, "html5lib")
...  for attr in ['head','html','body']:
...    if hasattr(soup, attr):
...      getattr(soup, attr).unwrap()
...  return soup
>>>
>>> html = "<p>Hello World</p>"
>>> print(get_body_contents(html))
<p>Hello World</p>
like image 29
Jeremy Avatar answered Oct 10 '22 05:10

Jeremy