My html looks like:
<td>
<table ..>
<tr>
<th ..>price</th>
<th>$99.99</th>
</tr>
</table>
</td>
So I am in the current table cell, how would I get the 99.99 value?
I have so far:
td[3].findChild('th')
But I need to do:
Find th with text 'price', then get next th tag's string value.
find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.
BeautifulSoup has a built-in method to parse the text out of an element, which is get_text() . In order to use it, you can simply call the method on any Tag or BeautifulSoup object. get_text() does not work on NavigableString because the object itself represents a string.
Think about it in "steps"... given that some x
is the root of the subtree you're considering,
x.findAll(text='price')
is the list of all items in that subtree containing text 'price'
. The parents of those items then of course will be:
[t.parent for t in x.findAll(text='price')]
and if you only want to keep those whose "name" (tag) is 'th'
, then of course
[t.parent for t in x.findAll(text='price') if t.parent.name=='th']
and you want the "next siblings" of those (but only if they're also 'th'
s), so
[t.parent.nextSibling for t in x.findAll(text='price')
if t.parent.name=='th' and t.parent.nextSibling and t.parent.nextSibling.name=='th']
Here you see the problem with using a list comprehension: too much repetition, since we can't assign intermediate results to simple names. Let's therefore switch to a good old loop...:
Edit: added tolerance for a string of text between the parent th
and the "next sibling" as well as tolerance for the latter being a td
instead, per OP's comment.
for t in x.findAll(text='price'):
p = t.parent
if p.name != 'th': continue
ns = p.nextSibling
if ns and not ns.name: ns = ns.nextSibling
if not ns or ns.name not in ('td', 'th'): continue
print ns.string
I've added ns.string
, that will give the next sibling's contents if and only if they're just text (no further nested tags) -- of course you can instead analize further at this point, depends on your application's needs!-). Similarly, I imagine you won't be doing just print
but something smarter, but I'm giving you the structure.
Talking about the structure, notice that twice I use if...: continue
: this reduces nesting compared to the alternative of inverting the if
's condition and indenting all the following statements in the loop -- and "flat is better than nested" is one of the koans in the Zen of Python (import this
at an interactive prompt to see them all and meditate;-).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With