Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exclude Tags Based on Content in Beautifulsoup

I'm scraping html data that is similar to the following:

<div class="target-content">
    <p id="random1">
      "the content of the p"
    </p>

    <p id="random2">
      "the content of the p"
    </p>

    <p>
      <q class="semi-predictable">
         "q tag content that I don't want
      </q>
    </p>

    <p id="random3">
      "the content of the p"
    </p>

</div>

My goal is to get all the <p> tags, along with their content—while being able to exclude the <q> tag, along with it's content. Currently, I getting all the <p> tags with the following approach:

contentlist = soup.find('div', class_='target-content').find_all('p')

My question, is after I find the result set of all the <p> tags, how can I filter out the single <p>, along with it's content, that contains the <q>?

Of Note: after getting the results set from soup.find('div', class_='target-content')find_all('p'), I am iteratively adding each <p> from the result set to a list in the following manner:

content = ''
    for p in contentlist:
        content += str(p)
like image 548
alphazwest Avatar asked Jun 27 '16 15:06

alphazwest


People also ask

How do you exclude a tag in BeautifulSoup?

You can use extract() to remove unwanted tag before you get text. But it keeps all '\n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML). extract() works but only if u have only one unwanted.

What function in BeautifulSoup will remove a tag from the HTML tree and destroy it?

decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.


1 Answers

You can just skip p tags having the q tag inside:

for p in soup.select('div.target-content > p'):
    if p.q:  # if q is present - skip
        continue
    print(p)

where p.q is a shortcut to p.find("q"). div.target-content > p is a CSS selector which would match all p tags that are direct children of div element with target-content class.

like image 143
alecxe Avatar answered Oct 14 '22 15:10

alecxe