Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup parent tag

I have some html that I want to extract text from. Here's an example of the html:

<p>TEXT I WANT <i> &#8211; </i></p>

Now, there are, obviously, lots of <p> tags in this document. So, find('p') is not a good way to get at the text I want to extract. However, that <i> tag is the only one in the document. So, I thought I could just find the <i> and then go to the parent.

I've tried:

up = soup.select('p i').parent

and

up = soup.select('i')
print(up.parent)

and I've tried it with .parents, I've tried find_all('i'), find('i')... But I always get:

'list' object has no attribute "parent"

What am I doing wrong?

like image 546
porteclefs Avatar asked Feb 25 '14 19:02

porteclefs


People also ask

What is the parent of a beautifulsoup object?

And the .parent of a BeautifulSoup object is defined as None: You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document:

How to use parent function in Beautiful Soup (BS4) Python?

The ".parent" function is provided by the beautiful soup (bs4) python library. The ".parent" function is used to access the parent of any element. For example, the head tag is the parent of the title tag. Import necessary modules. Load an HTML document.

How do I search for tags in Beautiful Soup?

If you pass one of the find* methods both string and a tag-specific argument like name, Beautiful Soup will search for tags that match your tag-specific criteria and whose Tag.string matches your value for string. It will not find the strings themselves. Previously, Beautiful Soup ignored the tag-specific arguments and looked for strings.

How do I extract a tag from a beautifulsoup document?

PageElement.extract () removes a tag or string from the tree. It returns the tag or string that was extracted: At this point you effectively have two parse trees: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted. You can go on to call extract on a child of the element you extracted:


2 Answers

find_all() returns a list. find('i') returns the first matching element, or None.

Thus, use:

try:
    up = soup.find('i').parent
except AttributeError:
    # no <i> element

Demo:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<p>TEXT I WANT <i> &#8211; </i></p>')
>>> soup.find('i').parent
<p>TEXT I WANT <i> – </i></p>
>>> soup.find('i').parent.text
u'TEXT I WANT  \u2013 '
like image 139
Martijn Pieters Avatar answered Nov 07 '22 13:11

Martijn Pieters


This works:

i_tag = soup.find('i')
my_text = str(i_tag.previousSibling).strip()

output:

'TEXT I WANT'

As mentioned in other answers, find_all() returns a list, whereas find() returns the first match or None

If you are unsure about the presence of an i tag you could simply use a try/except block

like image 20
Totem Avatar answered Nov 07 '22 12:11

Totem