BeautifulSoup remove tags followed by specific tag and specific attribute

Question

I'm new in this world of web scraping and so far I've been amazed with BeautifulSoup. However, there's something I wasn't able to do.

What I want to do is to remove some tags which are followed with some specific tag and specific attribute.

Let me show you:

#Import modules
from bs4 import BeautifulSoup
import requests

#Parse URL
url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')

#This is the table which I want to extract
table = soup.find_all('table')[4]

After obtaining the correct table which I want to manipulate, there are some 'tr' tags which are followed by 'td' and attribute 'colspan'

What I finally want is to remove those specific 'tr' because there are more 'tr' tags which I need.

The total of 'td' with 'colspan' attribute are 3:

#Output for 'td' with 'colspan'

print(table.select('td[colspan]'))

[<td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>,
 <td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>,
 <td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>]

And here is a extract from the HTML and one example of the specific 'tr' which I want to remove (Inserted a Note below saying "#THIS ONE!"):

 <td align="center">
    2:1
   </td>
   <td class="one">
    AC Milan
   </td>
   <td>
    <a href="/Cagliari-AC_Milan-2320071-2320071.html">
     <img alt="More details about  -  soccer game" border="0" height="14" src="/imgs/detail3.gif" width="14"/>
    </a>
   </td>
  </tr>
  ***<tr class="predict"> ------------- >>> **#THIS ONE!*****
   <td colspan="13">
    <img height="10" src="/imgs/line.png" width="100%"/>
   </td>
   <tr class="predict">
    <td>
     27 May
    </td>
    <td>
     38
    </td>
    <td>
     FT
    </td>
    <td align="right" class="one">

By the way, I would like to remove 'td colspan' and 'img' as well.

Any ideas?

*Python latest version installed

*BeautifulSoup module latest version installed

MD. Khairul Basar · Accepted Answer

Find the specific tags you want to delete and then use deompose() or extract().

for tag in tags_to_delete:
    tag.decompose()

Or

for tag in tags_to_delete:
    tag.extract()

EDIT

To find the specific tags you can first find all the tr tags and then check if that tag has a td with attribute colspan="13" if yes then decompose() it.

import requests
from bs4 import BeautifulSoup

url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')

table = soup.find_all('table')[4]    
for t in table.find_all("tr", class_="predict"):

    check = t.find("td", colspan="13")
    if(check != None):
        t.decompose()

BeautifulSoup remove tags followed by specific tag and specific attribute

Tags:

python

html

beautifulsoup

Edmundo Wright

1 Answers

MD. Khairul Basar

Recent Activity

Donate For Us

BeautifulSoup remove tags followed by specific tag and specific attribute

Tags:

python

html

beautifulsoup

Edmundo Wright

1 Answers

MD. Khairul Basar

Related questions

Recent Activity

Donate For Us