What does BeautifulSoup's .content do? I am working through crummy.com's tutorial and I don't really understand what .content does. I have looked at the forums and I have not seen any answers. Looking at the code below....
from BeautifulSoup import BeautifulSoup
import re
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[0].contents[0].contents[0].name
I would expect the last line of the code to print out 'body' instead of...
File "pe_ratio.py", line 29, in <module>
print soup.contents[0].contents[0].contents[0].contents[0].name
File "C:\Python27\lib\BeautifulSoup.py", line 473, in __getattr__
raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
AttributeError: 'NavigableString' object has no attribute 'name'
Is .content only concerned with html, head and title? If, so why is that?
Thanks for the help in advance.
It just gives you whats inside the tag. Let me demonstrate with an example:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
head = soup.head
print head.contents
The above code gives me a list,[<title>The Dormouse's story</title>]
, because thats inside the head
tag. So calling [0]
would give you the first item in the list.
The reason you get an error is because soup.contents[0].contents[0].contents[0].contents[0]
returns something with no further tags (therefore no attributes). It returns Page Title
from your code, because the first contents[0]
gives you the HTML tag, the second one, gives you the head
tag. The third one leads to the title
tag, and the fourth one gives you the actual content. So, when you call a name
on it, it has no tags to give you.
If you want the body printed, you can do the following:
soup = BeautifulSoup(''.join(doc))
print soup.body
If you want body
using contents
only, then use the following:
soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[1].name
You will not get it using [0]
as the index, because body
is the second element after head
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With