Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem accessing attributes in BeautifulSoup

I am having problems using Python (2.7). The code basically consists of:

str = '<el at="some">ABC</el><el>DEF</el>'
z = BeautifulStoneSoup(str)

for x in z.findAll('el'):
    # if 'at' in x:
    # if hasattr(x, 'at'):
        print x['at']   
    else:
        print 'nothing'

I expected the first if statement to work correctly (ie: if at doesn't exist, print "nothing"), but it always prints nothing (ie: is always False). The second if on the other hand is always True, which will cause the code to raise a KeyError when trying to access at from the second <el> element, which of course doesn't exist.

like image 792
NullUserException Avatar asked May 01 '11 12:05

NullUserException


People also ask

Does Beautifulsoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.

How do you get attributes of an element in Beautifulsoup?

To extract attributes of elements in Beautiful Soup, use the [~] notation. For instance, el["id"] retrieves the value of the id attribute.

Which method in Beautifulsoup is used for extracting the attributes from HTML?

We can do this by using the Request library of Python. Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List. Analyzing the HTML tags and their attributes, such as class, id, and other HTML tag attributes.


1 Answers

The in operator is for sequence and mapping types, what makes you think the object returned by BeautifulSoup is supposed to implement it correctly? According to the BeautifulSoup docs, you should access attributes using the [] syntax.

Re hasattr, I think you confused HTML/XML attributes and Python object attributes. hasattr is for the latter, and BeaitufulSoup AFAIK doesn't reflect the HTML/XML attributes it parsed in its own object attributes.

P.S. note that the Tag object in BeautifulSoup does implement __contains__ - so maybe you're trying with the wrong object? Can you show a complete but minimal example that demonstrates the problem?


Running this:

from BeautifulSoup import BeautifulSoup

str = '<el at="some">ABC</el><el>DEF</el>'
z = BeautifulSoup(str)

for x in z.findAll('el'):
    print type(x)
    print x['at']

I get:

<class 'BeautifulSoup.Tag'>
some
<class 'BeautifulSoup.Tag'>
Traceback (most recent call last):
  File "soup4.py", line 8, in <module>
    print x['at']
  File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 601, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'at'

Which is what I expected. The first el has a at attribute, the second doesn't - and this throws a KeyError.


Update 2: the BeautifulSoup.Tag.__contains__ looks inside the contents of the tag, not its attributes. To check if an attribute exists use in.

like image 176
Eli Bendersky Avatar answered Nov 10 '22 21:11

Eli Bendersky