I'm scraping html data that is similar to the following:
<div class="target-content">
<p id="random1">
"the content of the p"
</p>
<p id="random2">
"the content of the p"
</p>
<p>
<q class="semi-predictable">
"q tag content that I don't want
</q>
</p>
<p id="random3">
"the content of the p"
</p>
</div>
My goal is to get all the <p>
tags, along with their content—while being able to exclude the <q>
tag, along with it's content. Currently, I getting all the <p>
tags with the following approach:
contentlist = soup.find('div', class_='target-content').find_all('p')
My question, is after I find the result set of all the <p>
tags, how can I filter out the single <p>
, along with it's content, that contains the <q>
?
Of Note: after getting the results set from soup.find('div', class_='target-content')find_all('p')
, I am iteratively adding each <p>
from the result set to a list in the following manner:
content = ''
for p in contentlist:
content += str(p)
You can use extract() to remove unwanted tag before you get text. But it keeps all '\n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML). extract() works but only if u have only one unwanted.
decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
You can just skip p
tags having the q
tag inside:
for p in soup.select('div.target-content > p'):
if p.q: # if q is present - skip
continue
print(p)
where p.q
is a shortcut to p.find("q")
. div.target-content > p
is a CSS selector which would match all p
tags that are direct children of div
element with target-content
class.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With