Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you get the text from an HTML 'datacell' using BeautifulSoup

I have been trying to strip out some data from HTML files. I have the logic coded to get the right cells. Now I am struggling to get the actual contents of the 'cell':

here is my HTML snippet:

headerRows[0][10].contents

  [<font size="+0"><font face="serif" size="1"><b>Apples Produced</b><font size="3">       
  </font></font></font>]

Note that this is a list item from Python [].

I need the value Apples Produced but can't get to it.

Any suggestions would be appreciated

Suggestions on a good book that explains this would earn my eternal gratitude


Thanks for that answer. However-isn't there a more general answer. What happens if my cell doesn't have a bold attribute

say it is:

 [<font size="+0"><font face="serif" size="1"><I>Apples Produced</I><font size="3">       
  </font></font></font>]

Apples Produced

I am trying to learn to read/understand the documentation and your response will help

I really appreciate this help. The best thing about these answers is that it is a lot easier to generalize from them then I have been able to do so from the BeautifulSoup documentation. I learned to program in the Fortran era and now I am learning python and I am amazed at its power - BeautifulSoup is an example. Making a coherent whole of the documentation is tough for me.

Cheers

like image 433
PyNEwbie Avatar asked Oct 21 '08 20:10

PyNEwbie


1 Answers

The BeautifulSoup documentation should cover everything you need - in this case it looks like you want to use findNext:

headerRows[0][10].findNext('b').string

A more generic solution which doesn't rely on the <b> tag would be to use the text argument to findAll, which allows you to search only for NavigableString objects:

>>> s = BeautifulSoup(u'<p>Test 1 <span>More</span> Test 2</p>')
>>> u''.join([s.string for s in s.findAll(text=True)])
u'Test 1 More Test 2'
like image 143
Jonny Buchanan Avatar answered Oct 23 '22 22:10

Jonny Buchanan