I have been trying to strip out some data from HTML files. I have the logic coded to get the right cells. Now I am struggling to get the actual contents of the 'cell':
here is my HTML snippet:
headerRows[0][10].contents
[<font size="+0"><font face="serif" size="1"><b>Apples Produced</b><font size="3">
</font></font></font>]
Note that this is a list item from Python [].
I need the value Apples Produced but can't get to it.
Any suggestions would be appreciated
Suggestions on a good book that explains this would earn my eternal gratitude
Thanks for that answer. However-isn't there a more general answer. What happens if my cell doesn't have a bold attribute
say it is:
[<font size="+0"><font face="serif" size="1"><I>Apples Produced</I><font size="3">
</font></font></font>]
Apples Produced
I am trying to learn to read/understand the documentation and your response will help
I really appreciate this help. The best thing about these answers is that it is a lot easier to generalize from them then I have been able to do so from the BeautifulSoup documentation. I learned to program in the Fortran era and now I am learning python and I am amazed at its power - BeautifulSoup is an example. Making a coherent whole of the documentation is tough for me.
Cheers
The BeautifulSoup documentation should cover everything you need - in this case it looks like you want to use findNext
:
headerRows[0][10].findNext('b').string
A more generic solution which doesn't rely on the <b>
tag would be to use the text argument to findAll
, which allows you to search only for NavigableString
objects:
>>> s = BeautifulSoup(u'<p>Test 1 <span>More</span> Test 2</p>')
>>> u''.join([s.string for s in s.findAll(text=True)])
u'Test 1 More Test 2'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With