Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape elements that immediately follows a certain element?

I have an Html document that look like this:

<div id="whatever">
  <a href="unwanted link"></a>
  <a href="unwanted link"></a>
  ...
  <code>blah blah</code>
  ...
  <a href="interesting link"></a>
  <a href="interesting link"></a>
  ...
</div>

I want to scrape only links that immediately follows the code tag. If I do soup.findAll('a') it returns all hyperlinks.

How can I make BS4 to start scraping after that specific code element?

like image 636
masroore Avatar asked Dec 27 '15 08:12

masroore


People also ask

How do I choose a class for web scraping?

The class selector scrapes all the elements with a specific class attribute. A class to search for an element can have multiple classes. Only one of them must match and to select elements with a specific class write a period ( . ) character followed by the name of the class.


1 Answers

Try soup.find_all_next():

>>> tag = soup.find('div', {'id': "whatever"})
>>> tag.find('code').find_all_next('a')
[<a href="interesting link"></a>, <a href="interesting link"></a>]
>>> 

It like soup.find_all(), but it finds all tags after a tag.


If you'd like remove the <a> tags before <code>, we have a function called soup.find_all_previous():

>>> tag.find('code').find_all_previous('a')
[<a href="unwanted link"></a>, <a href="unwanted link"></a>]

>>> for i in tag.find('code').find_all_previous('a'):
...     i.extract()
...     
... 
<a href="unwanted link"></a>
<a href="unwanted link"></a>

>>> tag
<div id="whatever">


  ...
  <code>blah blah</code>
  ...
  <a href="interesting link"></a>
<a href="interesting link"></a>
  ...
</div>
>>> 

So that is:

  1. Find all <a> tags which before <code> tag.
  2. Use soup.extract() with a for loop remove them.
like image 186
Remi Crystal Avatar answered Nov 14 '22 23:11

Remi Crystal