I have an Html document that look like this:
<div id="whatever">
<a href="unwanted link"></a>
<a href="unwanted link"></a>
...
<code>blah blah</code>
...
<a href="interesting link"></a>
<a href="interesting link"></a>
...
</div>
I want to scrape only links that immediately follows the code
tag. If I do soup.findAll('a')
it returns all hyperlinks.
How can I make BS4 to start scraping after that specific code
element?
The class selector scrapes all the elements with a specific class attribute. A class to search for an element can have multiple classes. Only one of them must match and to select elements with a specific class write a period ( . ) character followed by the name of the class.
Try soup.find_all_next()
:
>>> tag = soup.find('div', {'id': "whatever"})
>>> tag.find('code').find_all_next('a')
[<a href="interesting link"></a>, <a href="interesting link"></a>]
>>>
It like soup.find_all()
, but it finds all tags after a tag.
If you'd like remove the <a>
tags before <code>
, we have a function called soup.find_all_previous()
:
>>> tag.find('code').find_all_previous('a')
[<a href="unwanted link"></a>, <a href="unwanted link"></a>]
>>> for i in tag.find('code').find_all_previous('a'):
... i.extract()
...
...
<a href="unwanted link"></a>
<a href="unwanted link"></a>
>>> tag
<div id="whatever">
...
<code>blah blah</code>
...
<a href="interesting link"></a>
<a href="interesting link"></a>
...
</div>
>>>
So that is:
<a>
tags which before <code>
tag.soup.extract()
with a for
loop remove them.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With