Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup nested tags

I'm trying to parse an XML with Beautifulsoup, but hit a brick wall when trying to use the "recursive" attribute with findall()

I have a pretty odd xml format shown below:

<?xml version="1.0"?>
<catalog>
   <book>
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
      <book>true</book>
   </book>
   <book>
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
      <book>false</book>
   </book>
 </catalog>

As you can see, the book tag repeats inside the book tag, which causes an error when I try to to something like:

from BeautifulSoup import BeautifulStoneSoup as BSS

catalog = "catalog.xml"


def open_rss():
    f = open(catalog, 'r')
    return f.read()

def rss_parser():
    rss_contents = open_rss()
    soup = BSS(rss_contents)
    items = soup.findAll('book', recursive=False)

    for item in items:
        print item.title.string

rss_parser()

As you will see, on my soup.findAll I've added recursive=false, which in theory would make it no recurse through the item found, but skip to the next one.

This doesn't seem to work, as I always get the following error:

  File "catalog.py", line 17, in rss_parser
    print item.title.string
AttributeError: 'NoneType' object has no attribute 'string'

I'm sure I'm doing something stupid here, and would appreciate if someone could give me some help on how to solve this problem.

Changing the HTML structure is not an option, this this code needs to perform well as it will potentially parse a large XML file.

like image 970
Marcos Placona Avatar asked Jan 04 '11 20:01

Marcos Placona


2 Answers

It appears the problem lies in the nested book tags. BautifulSoup has a predefined set of tags that can be nested (BeautifulSoup.NESTABLE_TAGS), but it doesn't know that book can be nested, so it goes wonkers.

Customizing the parser, explains what's going on and how you can subclass BeautifulStoneSoup to customise the nestable tags. Here's how we can use it to fix your problem:

from BeautifulSoup import BeautifulStoneSoup

class BookSoup(BeautifulStoneSoup):
  NESTABLE_TAGS = {
      'book': ['book']
  }

soup = BookSoup(xml) # xml string omitted to keep this short
for book in soup.find('catalog').findAll('book', recursive=False):
  print book.title.string

If we run this, we get the following output:

XML Developer's Guide
Midnight Rain
like image 132
moinudin Avatar answered Oct 25 '22 02:10

moinudin


soup.findAll('catalog', recursive=False) will return a list containing only your top-level "catalog" tag. Since that doesn't have a "title" child, item.title is None.

Try soup.findAll("book") or soup.find("catalog").findChildren() instead.

Edit: OK, the problem wasn't what I thought it was. Try this:

BSS.NESTABLE_TAGS["book"] = []
soup = BSS(open("catalog.xml"))
soup.catalog.findChildren(recursive=False)
like image 40
Thomas K Avatar answered Oct 25 '22 03:10

Thomas K