I noticed something odd about when working with BeautifulSoup and couldn't find any documentation to support this so I wanted to ask over here. Say we have a tags like these that we have parsed with BS: <pre class="prettyprint"><code><td>Some Table Data</td> <td></td> </code></pre> The official documented way to extract the data is <code>soup.string</code>. However this extracted a NoneType for the second <code><td></code> tag. So I tried <code>soup.text</code> (because why not?) and it extracted an empty string exactly as I wanted. However I couldn't find any reference to this in the documentation and am worried that something is a miss. Can anyone let me know if this is acceptable to use or will it cause problems later? BTW I am scraping table data from a web page and mean to create CSVs from the data so I do actually need empty strings rather than NoneTypes.

<code>.string</code> on a <code>Tag</code> type object returns a <code>NavigableString</code> type object. On the other hand, <code>.text</code> gets all the child strings and return concatenated using the given separator. Return type of .text is <code>unicode</code> object. From the documentation, A <code>NavigableString</code> is just like a Python <code>Unicode</code> string, except that it also supports some of the features described in Navigating the tree and Searching the tree. From the documentation on <code>.string</code>, we can see that, If the html is like this, <pre class="prettyprint"><code><td>Some Table Data</td> <td></td> </code></pre> Then, <code>.string</code> on the second td will return <code>None</code>. But <code>.text</code> will return and empty string which is a <code>unicode</code> type object. For more convenience, <code>string</code> <ul> <li>Convenience property of a <code>tag</code> to get the single string within this tag.</li> <li>If the <code>tag</code> has a single string child then the return value is that string.</li> <li>If the <code>tag</code> has no children or more than one child then the return value is <code>None</code> </li> <li>If this <code>tag</code> has one child tag then the return value is the 'string' attribute of the child tag, recursively.</li> </ul> And <code>text</code> <ul> <li>Get all the child strings and return concatenated using the given separator.</li> </ul> If the <code>html</code> is like this: <pre class="prettyprint"><code><td>some text</td> <td></td> <td>more text</td> <td>even more text</td> </code></pre> <code>.string</code> on the four <code>td</code> will return, <pre class="prettyprint"><code>some text None more text None </code></pre> <code>.text</code> will give result like this, <pre class="prettyprint"><code>some text more text even more text </code></pre>

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None: example: <pre class="prettyprint"><code><td>sometextsometext</td> </code></pre> The above code will return NoneType if: td.string is done because the td contains texts as well as another p tag. But td.text will give : sometextsometext

Difference between .string and .text BeautifulSoup

Tags:

python

beautifulsoup

I noticed something odd about when working with BeautifulSoup and couldn't find any documentation to support this so I wanted to ask over here.

Say we have a tags like these that we have parsed with BS:

<td>Some Table Data</td> <td></td>

The official documented way to extract the data is soup.string. However this extracted a NoneType for the second <td> tag. So I tried soup.text (because why not?) and it extracted an empty string exactly as I wanted.

However I couldn't find any reference to this in the documentation and am worried that something is a miss. Can anyone let me know if this is acceptable to use or will it cause problems later?

BTW I am scraping table data from a web page and mean to create CSVs from the data so I do actually need empty strings rather than NoneTypes.

885

asked Aug 15 '14 13:08

mez.pahlan

2 Answers

.string on a Tag type object returns a NavigableString type object. On the other hand, .text gets all the child strings and return concatenated using the given separator. Return type of .text is unicode object.

From the documentation, A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree.

From the documentation on .string, we can see that, If the html is like this,

<td>Some Table Data</td> <td></td>

Then, .string on the second td will return None. But .text will return and empty string which is a unicode type object.

For more convenience,

string

Convenience property of a tag to get the single string within this tag.
If the tag has a single string child then the return value is that string.
If the tag has no children or more than one child then the return value is None
If this tag has one child tag then the return value is the 'string' attribute of the child tag, recursively.

And text

Get all the child strings and return concatenated using the given separator.

If the html is like this:

<td>some text</td> <td></td> <td><p>more text</p></td> <td>even <p>more text</p></td>

.string on the four td will return,

some text None more text None

.text will give result like this,

some text  more text even more text

answered Sep 21 '22 20:09

salmanwahed

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

example:

<td>sometext<p>sometext</p></td>

The above code will return NoneType if: td.string is done because the td contains texts as well as another p tag. But td.text will give : sometextsometext

answered Sep 23 '22 20:09

Raju Thapa EverestBlogger

Related questions
                            
                                Subtracting Dates With Python
                            
                                How to stop Python closing immediately when executed in Microsoft Windows
                            
                                deque.popleft() and list.pop(0). Is there performance difference?
                            
                                gensim error : no module named gensim
                            
                                How could I use aws lambda to write file to s3 (python)?
                            
                                How to get the logical right binary shift in python
                            
                                Mountain Lion update and mercurial libraries python
                            
                                Dictionary Comprehension in Python 3
                            
                                How do I change the text size in a label widget, python tkinter [duplicate]
                            
                                isinstance and Mocking
                            
                                UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128) [duplicate]
                            
                                Does readlines() return a list or an iterator in Python 3?
                            
                                Which tool to use to parse programming languages in Python?
                            
                                Annotate Time Series plot in Matplotlib
                            
                                'int' object has no attribute '__getitem__'
                            
                                datetime to Unix timestamp with millisecond precision
                            
                                Capture arbitrary path in Flask route
                            
                                Calling a parent's parent's method, which has been overridden by the parent
                            
                                Python3 - reload() can not be called on __import__ object?
                            
                                How can I trigger a 500 error in Django?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With