Say that my html looks like this:
<td>Potato1 <span somestuff...>Potato2</span></td>
...
<td>Potato9 <span somestuff...>Potato10</span></td>
I have beautifulsoup doing this:
for tag in soup.find_all("td"):
print tag.text
And I get
Potato1 Potato2
....
Potato9 Potato10
Would it be possible to just get the text that's inside the tag but not any text nested inside the span tag?
You can use .contents
as
>>> for tag in soup.find_all("td"):
... print tag.contents[0]
...
Potato1
Potato9
What it does?
A tags children are available as a list using the .contents
.
>>> for tag in soup.find_all("td"):
... print tag.contents
...
[u'Potato1 ', <span somestuff...="">Potato2</span>]
[u'Potato9 ', <span somestuff...="">Potato10</span>]
since we are only interested in the first element, we go for
print tag.contents[0]
Another method, which, unlike tag.contents[0]
guarantees that the text is a
NavigableString
and not text from within a child Tag
, is:
[child for tag in soup.find_all("td")
for child in tag if isinstance(child, bs.NavigableString)]
Here is an example which highlights the difference:
import bs4 as bs
content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
soup = bs.BeautifulSoup(content)
print([tag.contents[0] for tag in soup.find_all("td")])
# [u'Potato1 ', <span>FOO</span>, <span>Potato10</span>]
print([child for tag in soup.find_all("td")
for child in tag if isinstance(child, bs.NavigableString)])
# [u'Potato1 ', u'Potato9']
Or, with lxml, you could use the XPath td/text()
:
import lxml.html as LH
content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
root = LH.fromstring(content)
print(root.xpath('td/text()'))
yields
['Potato1 ', 'Potato9']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With