I'm scraping html data that is similar to the following: <pre class="prettyprint"><code><div class="target-content"> "the content of the p" "the content of the p" <q class="semi-predictable"> "q tag content that I don't want </q> "the content of the p" </div> </code></pre> My goal is to get all the <code></code> tags, along with their content—while being able to exclude the <code><q></code> tag, along with it's content. Currently, I getting all the <code></code> tags with the following approach: <pre class="prettyprint"><code>contentlist = soup.find('div', class_='target-content').find_all('p') </code></pre> My question, is after I find the result set of all the <code></code> tags, how can I filter out the single <code></code>, along with it's content, that contains the <code><q></code>? Of Note: after getting the results set from <code>soup.find('div', class_='target-content')find_all('p')</code>, I am iteratively adding each <code></code> from the result set to a list in the following manner: <pre class="prettyprint"><code>content = '' for p in contentlist: content += str(p) </code></pre>

You can just skip <code>p</code> tags having the <code>q</code> tag inside: <pre class="prettyprint"><code>for p in soup.select('div.target-content > p'): if p.q: # if q is present - skip continue print(p) </code></pre> where <code>p.q</code> is a shortcut to <code>p.find("q")</code>. <code>div.target-content > p</code> is a CSS selector which would match all <code>p</code> tags that are direct children of <code>div</code> element with <code>target-content</code> class.

Exclude Tags Based on Content in Beautifulsoup

Tags:

python

beautifulsoup

web-scraping

I'm scraping html data that is similar to the following:

<div class="target-content">
    <p id="random1">
      "the content of the p"
    </p>

    <p id="random2">
      "the content of the p"
    </p>

    <p>
      <q class="semi-predictable">
         "q tag content that I don't want
      </q>
    </p>

    <p id="random3">
      "the content of the p"
    </p>

</div>

My goal is to get all the  tags, along with their content—while being able to exclude the <q> tag, along with it's content. Currently, I getting all the  tags with the following approach:

contentlist = soup.find('div', class_='target-content').find_all('p')

My question, is after I find the result set of all the  tags, how can I filter out the single , along with it's content, that contains the <q>?

Of Note: after getting the results set from soup.find('div', class_='target-content')find_all('p'), I am iteratively adding each  from the result set to a list in the following manner:

content = ''
    for p in contentlist:
        content += str(p)

548

asked Jun 27 '16 15:06

alphazwest

1 Answers

You can just skip p tags having the q tag inside:

for p in soup.select('div.target-content > p'):
    if p.q:  # if q is present - skip
        continue
    print(p)

where p.q is a shortcut to p.find("q"). div.target-content > p is a CSS selector which would match all p tags that are direct children of div element with target-content class.

143

answered Oct 14 '22 15:10

alecxe

Related questions
                            
                                How do I read a CSV file that's Gzipped from URL - Python [duplicate]
                            
                                What is the most efficient way to create a DataFrame from two unrelated series?
                            
                                How to get the Document Vector from Doc2Vec in gensim 0.11.1?
                            
                                Pygame.movie missing
                            
                                Prevent ipython from storing outputs in Out variable
                            
                                Efficiently reshape numpy array
                            
                                Is it possible to vectorize a function that access different elements in an numpy array?
                            
                                Remove consecutive duplicates in a NumPy array
                            
                                NGINX - Python - UWSGI kill issue
                            
                                Python Pandas : Convert multiple rows into single row, ignoring NaN's
                            
                                sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'property'
                            
                                Flask project in Visual Studio 2015: how to specify port number?
                            
                                Embedding album cover in MP4 file using Mutagen
                            
                                Split odd rows of DataFrame without double iloc
                            
                                PIL: add a text at the bottom middle of image
                            
                                RuntimeError: Can not put single artist in more than one figure when using matplotlib 1.5
                            
                                Neural Network composed of multiple activation functions
                            
                                Math behind scipy.ndimage.convolve
                            
                                @(at) operator at Python, how to use it? [duplicate]
                            
                                Why is Numpy inconsistent in ordering polynomial coefficients by degree?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With