<p>I have an Html document that look like this:</p> <pre class="prettyprint"><code><div id="whatever"> <a href="unwanted link"></a> <a href="unwanted link"></a> ... <code>blah blah</code> ... <a href="interesting link"></a> <a href="interesting link"></a> ... </div> </code></pre> <p>I want to scrape only links that immediately follows the <code>code</code> tag. If I do <code>soup.findAll('a')</code> it returns all hyperlinks.</p> <p>How can I make BS4 to start scraping after that specific <code>code</code> element?</p>

<p>Try <code>soup.find_all_next()</code>:</p> <pre class="prettyprint"><code>>>> tag = soup.find('div', {'id': "whatever"}) >>> tag.find('code').find_all_next('a') [<a href="interesting link"></a>, <a href="interesting link"></a>] >>> </code></pre> <p>It like <code>soup.find_all()</code>, but it finds all tags <strong>after a tag</strong>.</p> <hr> <p>If you'd like remove the <code><a></code> tags before <code><code></code>, we have a function called <code>soup.find_all_previous()</code>:</p> <pre class="prettyprint"><code>>>> tag.find('code').find_all_previous('a') [<a href="unwanted link"></a>, <a href="unwanted link"></a>] >>> for i in tag.find('code').find_all_previous('a'): ... i.extract() ... ... <a href="unwanted link"></a> <a href="unwanted link"></a> >>> tag <div id="whatever"> ... <code>blah blah</code> ... <a href="interesting link"></a> <a href="interesting link"></a> ... </div> >>> </code></pre> <p>So that is:</p> <ol> <li>Find all <code><a></code> tags which before <code><code></code> tag.</li> <li>Use <code>soup.extract()</code> with a <code>for</code> loop remove them.</li> </ol>

How to scrape elements that immediately follows a certain element?

Tags:

python

beautifulsoup

I have an Html document that look like this:

<div id="whatever">
  <a href="unwanted link"></a>
  <a href="unwanted link"></a>
  ...
  <code>blah blah</code>
  ...
  <a href="interesting link"></a>
  <a href="interesting link"></a>
  ...
</div>

I want to scrape only links that immediately follows the code tag. If I do soup.findAll('a') it returns all hyperlinks.

How can I make BS4 to start scraping after that specific code element?

636

asked Dec 27 '15 08:12

masroore

1 Answers

Try soup.find_all_next():

>>> tag = soup.find('div', {'id': "whatever"})
>>> tag.find('code').find_all_next('a')
[<a href="interesting link"></a>, <a href="interesting link"></a>]
>>>

It like soup.find_all(), but it finds all tags after a tag.

If you'd like remove the <a> tags before <code>, we have a function called soup.find_all_previous():

>>> tag.find('code').find_all_previous('a')
[<a href="unwanted link"></a>, <a href="unwanted link"></a>]

>>> for i in tag.find('code').find_all_previous('a'):
...     i.extract()
...     
... 
<a href="unwanted link"></a>
<a href="unwanted link"></a>

>>> tag
<div id="whatever">


  ...
  <code>blah blah</code>
  ...
  <a href="interesting link"></a>
<a href="interesting link"></a>
  ...
</div>
>>>

So that is:

Find all <a> tags which before <code> tag.
Use soup.extract() with a for loop remove them.

186

answered Nov 14 '22 23:11

Remi Crystal

Related questions
                            
                                Some questions about Flask sessions
                            
                                How to fit a double Gaussian distribution in Python?
                            
                                Django Error - Reverse for 'password_reset_confirm' with arguments '()' and keyword arguments '
                            
                                Is there a simple way to get rid of junk values that come when you SSH using Python's Paramiko library and fetch output from CLI of a remote machine?
                            
                                Python requests.post multipart/form-data [duplicate]
                            
                                Iterative solving of sparse systems of linear equations with (M, N) right-hand size matrix
                            
                                Django template: Embed css from file
                            
                                How can I obtain the same 'special' solutions to underdetermined linear systems that Matlab's `A \ b` (mldivide) operator returns using numpy/scipy?
                            
                                Lists are the same but not considered equal?
                            
                                Overloading the [] operator in python class to refer to a numpy.array data member
                            
                                Spark using Python : save RDD output into text files
                            
                                Mutable default argument for a Python namedtuple
                            
                                Flask-Admin / Flask-SQLAlchemy: set user_id = current_user for INSERT
                            
                                MySQLdb raises "execute() first" error even though I execute before calling fetchall
                            
                                Where can the RDS_DB_NAME setting for an Elastic Beanstalk environment be changed
                            
                                Difference between local and dense layers in CNNs
                            
                                Can't reproduce distance value between sources obtained with astropy
                            
                                How to change request url before making request in scrapy?
                            
                                Installed Anaconda for python 2 and 3. Can't run 2
                            
                                Errno13, Permission denied when trying to read file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With