Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove all different script tags in BeautifulSoup?

I crawl a table from a web link and would like to rebuild a table by removing all script tags. Here are the source codes.

response = requests.get(url)
soup = BeautifulSoup(response.text)
table = soup.find('table')

for row in table.find_all('tr') :                                                                                                                                                                                                                                                                                                                                                                                                     
    for col in row.find_all('td'):
        #remove all different script tags
        #col.replace_with('') 
        #col.decompose()  
        #col.extract()
        col = col.contents

How can I remove all different script tags? Take the follow cell as an exampple, which includes the tag a, br and td.

<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>

My expected result is:

Signal et Communication
Ingénierie Réseaux et Télécommunications
like image 989
SparkAndShine Avatar asked Jul 18 '15 17:07

SparkAndShine


2 Answers

You are asking about get_text():

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string

td = soup.find("td")
td.get_text()

Note that .string would return you None in this case since td has multiple children:

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

Demo:

>>> from bs4 import BeautifulSoup
>>> 
>>> soup = BeautifulSoup(u"""
... <td><a href="http://www.irit.fr/SC">Signal et Communication</a>
... <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
... </td>
... """)
>>> 
>>> td = soup.td
>>> print td.string
None
>>> print td.get_text()
Signal et Communication
Ingénierie Réseaux et Télécommunications
like image 114
alecxe Avatar answered Oct 02 '22 23:10

alecxe


Try calling col.string. That will give you the text only.

like image 44
blasko Avatar answered Oct 02 '22 23:10

blasko