Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup remove tags followed by specific tag and specific attribute

I'm new in this world of web scraping and so far I've been amazed with BeautifulSoup. However, there's something I wasn't able to do.

What I want to do is to remove some tags which are followed with some specific tag and specific attribute.

Let me show you:

#Import modules
from bs4 import BeautifulSoup
import requests

#Parse URL
url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')

#This is the table which I want to extract
table = soup.find_all('table')[4]

After obtaining the correct table which I want to manipulate, there are some 'tr' tags which are followed by 'td' and attribute 'colspan'

What I finally want is to remove those specific 'tr' because there are more 'tr' tags which I need.

The total of 'td' with 'colspan' attribute are 3:

#Output for 'td' with 'colspan'

print(table.select('td[colspan]'))

[<td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>,
 <td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>,
 <td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>]

And here is a extract from the HTML and one example of the specific 'tr' which I want to remove (Inserted a Note below saying "#THIS ONE!"):

 <td align="center">
    2:1
   </td>
   <td class="one">
    AC Milan
   </td>
   <td>
    <a href="/Cagliari-AC_Milan-2320071-2320071.html">
     <img alt="More details about  -  soccer game" border="0" height="14" src="/imgs/detail3.gif" width="14"/>
    </a>
   </td>
  </tr>
  ***<tr class="predict"> ------------- >>> **#THIS ONE!*****
   <td colspan="13">
    <img height="10" src="/imgs/line.png" width="100%"/>
   </td>
   <tr class="predict">
    <td>
     27 May
    </td>
    <td>
     38
    </td>
    <td>
     FT
    </td>
    <td align="right" class="one">

By the way, I would like to remove 'td colspan' and 'img' as well.

Any ideas?

*Python latest version installed

*BeautifulSoup module latest version installed

like image 501
Edmundo Wright Avatar asked Jan 20 '26 03:01

Edmundo Wright


1 Answers

Find the specific tags you want to delete and then use deompose() or extract().

for tag in tags_to_delete:
    tag.decompose()

Or

for tag in tags_to_delete:
    tag.extract() 

EDIT

To find the specific tags you can first find all the tr tags and then check if that tag has a td with attribute colspan="13" if yes then decompose() it.

import requests
from bs4 import BeautifulSoup

url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')

table = soup.find_all('table')[4]    
for t in table.find_all("tr", class_="predict"):

    check = t.find("td", colspan="13")
    if(check != None):
        t.decompose()
like image 58
MD. Khairul Basar Avatar answered Jan 22 '26 17:01

MD. Khairul Basar