I crawl a table from a web link and would like to rebuild a table by removing all script tags. Here are the source codes.
response = requests.get(url)
soup = BeautifulSoup(response.text)
table = soup.find('table')
for row in table.find_all('tr') :
for col in row.find_all('td'):
#remove all different script tags
#col.replace_with('')
#col.decompose()
#col.extract()
col = col.contents
How can I remove all different script tags? Take the follow cell as an exampple, which includes the tag a
, br
and td
.
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
My expected result is:
Signal et Communication
Ingénierie Réseaux et Télécommunications
You are asking about get_text()
:
If you only want the text part of a document or tag, you can use the
get_text()
method. It returns all the text in a document or beneath a tag, as a single Unicode string
td = soup.find("td")
td.get_text()
Note that .string
would return you None
in this case since td
has multiple children:
If a tag contains more than one thing, then it’s not clear what
.string
should refer to, so.string
is defined to beNone
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> soup = BeautifulSoup(u"""
... <td><a href="http://www.irit.fr/SC">Signal et Communication</a>
... <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
... </td>
... """)
>>>
>>> td = soup.td
>>> print td.string
None
>>> print td.get_text()
Signal et Communication
Ingénierie Réseaux et Télécommunications
Try calling col.string. That will give you the text only.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With