I have some html that I want to extract text from. Here's an example of the html:
<p>TEXT I WANT <i> – </i></p>
Now, there are, obviously, lots of <p>
tags in this document. So, find('p')
is not a good way to get at the text I want to extract. However, that <i>
tag is the only one in the document. So, I thought I could just find the <i>
and then go to the parent.
I've tried:
up = soup.select('p i').parent
and
up = soup.select('i')
print(up.parent)
and I've tried it with .parents
, I've tried find_all('i')
, find('i')
... But I always get:
'list' object has no attribute "parent"
What am I doing wrong?
And the .parent of a BeautifulSoup object is defined as None: You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document:
The ".parent" function is provided by the beautiful soup (bs4) python library. The ".parent" function is used to access the parent of any element. For example, the head tag is the parent of the title tag. Import necessary modules. Load an HTML document.
If you pass one of the find* methods both string and a tag-specific argument like name, Beautiful Soup will search for tags that match your tag-specific criteria and whose Tag.string matches your value for string. It will not find the strings themselves. Previously, Beautiful Soup ignored the tag-specific arguments and looked for strings.
PageElement.extract () removes a tag or string from the tree. It returns the tag or string that was extracted: At this point you effectively have two parse trees: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted. You can go on to call extract on a child of the element you extracted:
find_all()
returns a list. find('i')
returns the first matching element, or None
.
Thus, use:
try:
up = soup.find('i').parent
except AttributeError:
# no <i> element
Demo:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<p>TEXT I WANT <i> – </i></p>')
>>> soup.find('i').parent
<p>TEXT I WANT <i> – </i></p>
>>> soup.find('i').parent.text
u'TEXT I WANT \u2013 '
This works:
i_tag = soup.find('i')
my_text = str(i_tag.previousSibling).strip()
output:
'TEXT I WANT'
As mentioned in other answers, find_all()
returns a list, whereas find()
returns the first match or None
If you are unsure about the presence of an i tag you could simply use a try/except
block
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With