Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between .string and .text BeautifulSoup

I noticed something odd about when working with BeautifulSoup and couldn't find any documentation to support this so I wanted to ask over here.

Say we have a tags like these that we have parsed with BS:

<td>Some Table Data</td> <td></td> 

The official documented way to extract the data is soup.string. However this extracted a NoneType for the second <td> tag. So I tried soup.text (because why not?) and it extracted an empty string exactly as I wanted.

However I couldn't find any reference to this in the documentation and am worried that something is a miss. Can anyone let me know if this is acceptable to use or will it cause problems later?

BTW I am scraping table data from a web page and mean to create CSVs from the data so I do actually need empty strings rather than NoneTypes.

like image 885
mez.pahlan Avatar asked Aug 15 '14 13:08

mez.pahlan


People also ask

What is the difference between text and string in Python?

The major difference between the two fields is how many characters you can put in these fields. A string field has a limit of 255 characters, whereas a text field has a character limit of 30,000 characters. A string field is a good choice if you wanting to store data like address, names, or simple custom data.

Is navigable string editable in BeautifulSoup?

The navigablestring object is used to represent the contents of a tag. To access the contents, use “. string” with tag. You can replace the string with another string but you can't edit the existing string.

What is beautifulsoup4 in Python?

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

What is the name of a BeautifulSoup object?

The starting point of any BeautifulSoup project, is the BeautifulSoup object. A BeautifulSoup object represents the input HTML/XML document used for its creation. We can either pass a string or a file-like object for Beautiful Soup, where files (objects) are either locally stored in our machine or a web page.


2 Answers

.string on a Tag type object returns a NavigableString type object. On the other hand, .text gets all the child strings and return concatenated using the given separator. Return type of .text is unicode object.

From the documentation, A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree.

From the documentation on .string, we can see that, If the html is like this,

<td>Some Table Data</td> <td></td> 

Then, .string on the second td will return None. But .text will return and empty string which is a unicode type object.

For more convenience,

string

  • Convenience property of a tag to get the single string within this tag.
  • If the tag has a single string child then the return value is that string.
  • If the tag has no children or more than one child then the return value is None
  • If this tag has one child tag then the return value is the 'string' attribute of the child tag, recursively.

And text

  • Get all the child strings and return concatenated using the given separator.

If the html is like this:

<td>some text</td> <td></td> <td><p>more text</p></td> <td>even <p>more text</p></td> 

.string on the four td will return,

some text None more text None 

.text will give result like this,

some text  more text even more text 
like image 60
salmanwahed Avatar answered Sep 21 '22 20:09

salmanwahed


If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

example:

<td>sometext<p>sometext</p></td> 

The above code will return NoneType if: td.string is done because the td contains texts as well as another p tag. But td.text will give : sometextsometext

like image 31
Raju Thapa EverestBlogger Avatar answered Sep 23 '22 20:09

Raju Thapa EverestBlogger