Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup get only the "general" text in a td tag, and nothing in nested tags

Say that my html looks like this:

<td>Potato1 <span somestuff...>Potato2</span></td>
...
<td>Potato9 <span somestuff...>Potato10</span></td>

I have beautifulsoup doing this:

for tag in soup.find_all("td"):
    print tag.text

And I get

Potato1 Potato2
....
Potato9 Potato10

Would it be possible to just get the text that's inside the tag but not any text nested inside the span tag?

like image 662
Stupid.Fat.Cat Avatar asked Jul 07 '15 17:07

Stupid.Fat.Cat


2 Answers

You can use .contents as

>>> for tag in soup.find_all("td"):
...     print tag.contents[0]
...
Potato1
Potato9

What it does?

A tags children are available as a list using the .contents.

>>> for tag in soup.find_all("td"):
...     print tag.contents
...
[u'Potato1 ', <span somestuff...="">Potato2</span>]
[u'Potato9 ', <span somestuff...="">Potato10</span>]

since we are only interested in the first element, we go for

print tag.contents[0]
like image 174
nu11p01n73R Avatar answered Nov 13 '22 19:11

nu11p01n73R


Another method, which, unlike tag.contents[0] guarantees that the text is a NavigableString and not text from within a child Tag, is:

[child for tag in soup.find_all("td") 
 for child in tag if isinstance(child, bs.NavigableString)]

Here is an example which highlights the difference:

import bs4 as bs

content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
soup = bs.BeautifulSoup(content)

print([tag.contents[0] for tag in soup.find_all("td")])
# [u'Potato1 ', <span>FOO</span>, <span>Potato10</span>]

print([child for tag in soup.find_all("td") 
       for child in tag if isinstance(child, bs.NavigableString)])
# [u'Potato1 ', u'Potato9']

Or, with lxml, you could use the XPath td/text():

import lxml.html as LH

content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
root = LH.fromstring(content)

print(root.xpath('td/text()'))

yields

['Potato1 ', 'Potato9']
like image 27
unutbu Avatar answered Nov 13 '22 21:11

unutbu