I'm new in this world of web scraping and so far I've been amazed with BeautifulSoup. However, there's something I wasn't able to do.
What I want to do is to remove some tags which are followed with some specific tag and specific attribute.
Let me show you:
#Import modules
from bs4 import BeautifulSoup
import requests
#Parse URL
url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
#This is the table which I want to extract
table = soup.find_all('table')[4]
After obtaining the correct table which I want to manipulate, there are some 'tr' tags which are followed by 'td' and attribute 'colspan'
What I finally want is to remove those specific 'tr' because there are more 'tr' tags which I need.
The total of 'td' with 'colspan' attribute are 3:
#Output for 'td' with 'colspan'
print(table.select('td[colspan]'))
[<td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>,
<td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>,
<td colspan="13"><img height="10" src="/imgs/line.png" width="100%"/></td>]
And here is a extract from the HTML and one example of the specific 'tr' which I want to remove (Inserted a Note below saying "#THIS ONE!"):
<td align="center">
2:1
</td>
<td class="one">
AC Milan
</td>
<td>
<a href="/Cagliari-AC_Milan-2320071-2320071.html">
<img alt="More details about - soccer game" border="0" height="14" src="/imgs/detail3.gif" width="14"/>
</a>
</td>
</tr>
***<tr class="predict"> ------------- >>> **#THIS ONE!*****
<td colspan="13">
<img height="10" src="/imgs/line.png" width="100%"/>
</td>
<tr class="predict">
<td>
27 May
</td>
<td>
38
</td>
<td>
FT
</td>
<td align="right" class="one">
By the way, I would like to remove 'td colspan' and 'img' as well.
Any ideas?
*Python latest version installed
*BeautifulSoup module latest version installed
Find the specific tags you want to delete and then use deompose() or extract().
for tag in tags_to_delete:
tag.decompose()
Or
for tag in tags_to_delete:
tag.extract()
EDIT
To find the specific tags you can first find all the tr tags and then check if that tag has a td with attribute colspan="13" if yes then decompose() it.
import requests
from bs4 import BeautifulSoup
url = "http://www.soccervista.com/Italy-Serie_A-2016_2017-845699.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
table = soup.find_all('table')[4]
for t in table.find_all("tr", class_="predict"):
check = t.find("td", colspan="13")
if(check != None):
t.decompose()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With