I'm trying to parse an XML with Beautifulsoup, but hit a brick wall when trying to use the "recursive" attribute with findall()
I have a pretty odd xml format shown below:
<?xml version="1.0"?>
<catalog>
<book>
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
<book>true</book>
</book>
<book>
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
<book>false</book>
</book>
</catalog>
As you can see, the book tag repeats inside the book tag, which causes an error when I try to to something like:
from BeautifulSoup import BeautifulStoneSoup as BSS
catalog = "catalog.xml"
def open_rss():
f = open(catalog, 'r')
return f.read()
def rss_parser():
rss_contents = open_rss()
soup = BSS(rss_contents)
items = soup.findAll('book', recursive=False)
for item in items:
print item.title.string
rss_parser()
As you will see, on my soup.findAll I've added recursive=false, which in theory would make it no recurse through the item found, but skip to the next one.
This doesn't seem to work, as I always get the following error:
File "catalog.py", line 17, in rss_parser
print item.title.string
AttributeError: 'NoneType' object has no attribute 'string'
I'm sure I'm doing something stupid here, and would appreciate if someone could give me some help on how to solve this problem.
Changing the HTML structure is not an option, this this code needs to perform well as it will potentially parse a large XML file.
It appears the problem lies in the nested book
tags. BautifulSoup has a predefined set of tags that can be nested (BeautifulSoup.NESTABLE_TAGS
), but it doesn't know that book
can be nested, so it goes wonkers.
Customizing the parser, explains what's going on and how you can subclass BeautifulStoneSoup
to customise the nestable tags. Here's how we can use it to fix your problem:
from BeautifulSoup import BeautifulStoneSoup
class BookSoup(BeautifulStoneSoup):
NESTABLE_TAGS = {
'book': ['book']
}
soup = BookSoup(xml) # xml string omitted to keep this short
for book in soup.find('catalog').findAll('book', recursive=False):
print book.title.string
If we run this, we get the following output:
XML Developer's Guide
Midnight Rain
soup.findAll('catalog', recursive=False)
will return a list containing only your top-level "catalog" tag. Since that doesn't have a "title" child, item.title
is None
.
Try soup.findAll("book")
or soup.find("catalog").findChildren()
instead.
Edit: OK, the problem wasn't what I thought it was. Try this:
BSS.NESTABLE_TAGS["book"] = []
soup = BSS(open("catalog.xml"))
soup.catalog.findChildren(recursive=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With