Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using “renderContents” in BeautifulSoup with Python

Environment: Python 2.7 + BeautifulSoup 4.3.2

Here is a part of the original HTML code:

<dl><dt>Newest Item:</dt><dd><span class="NewsTime" title="Southeast in 2007">SE, 2007</span></dd></dl>

What I want to pick up is the “SE, 2007”.

What I worked out is:

from bs4 import BeautifulSoup
import re
import urllib2

url = "http://sample.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

NEWS = soup.find_all("span",class_="NewsTime", limit=1) #because there are 2 such the same

for LA in NEWS:
    print LA.renderContents()

It works. But it doesn’t work when I changed the last 2 rows to:

print NEWS.renderContents()

Why? Also, is my understanding about the original HTML code right?

<dl> is the father
<dt> and <dd> are the father’s son
<span> is <dd>’s son
like image 617
Mark K Avatar asked May 05 '26 04:05

Mark K


1 Answers

NEWS is a ResultSet as far as BeautifulSoup is concerned. It doesn't matter that there's only one result in the set - it's still a ResultSet, and you can't call renderContents() on a ResultSet.

The find_all() function always returns a bs4.element.ResultSet, containing zero or more elements of type bs4.element.Tag - you can only call renderContents() on the Tag object.

In this case, to save the for loop, you could just use a zero index on the first line here:

NEWS = soup.find_all("span",class_="NewsTime", limit=1)[0]

print(NEWS.renderContents())
like image 90
Stephan Avatar answered May 06 '26 20:05

Stephan