Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

beautifulsoup, Find th with text 'price', then get price from next th

My html looks like:

<td>
   <table ..>
      <tr>
         <th ..>price</th>
         <th>$99.99</th>
      </tr>
   </table>
</td>

So I am in the current table cell, how would I get the 99.99 value?

I have so far:

td[3].findChild('th')

But I need to do:

Find th with text 'price', then get next th tag's string value.

like image 541
Blankman Avatar asked Jul 31 '10 04:07

Blankman


People also ask

What is the difference between Find_all () and find () in BeautifulSoup?

find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.

How do I use BeautifulSoup to text?

BeautifulSoup has a built-in method to parse the text out of an element, which is get_text() . In order to use it, you can simply call the method on any Tag or BeautifulSoup object. get_text() does not work on NavigableString because the object itself represents a string.


1 Answers

Think about it in "steps"... given that some x is the root of the subtree you're considering,

x.findAll(text='price')

is the list of all items in that subtree containing text 'price'. The parents of those items then of course will be:

[t.parent for t in x.findAll(text='price')]

and if you only want to keep those whose "name" (tag) is 'th', then of course

[t.parent for t in x.findAll(text='price') if t.parent.name=='th']

and you want the "next siblings" of those (but only if they're also 'th's), so

[t.parent.nextSibling for t in x.findAll(text='price')
 if t.parent.name=='th' and t.parent.nextSibling and t.parent.nextSibling.name=='th']

Here you see the problem with using a list comprehension: too much repetition, since we can't assign intermediate results to simple names. Let's therefore switch to a good old loop...:

Edit: added tolerance for a string of text between the parent th and the "next sibling" as well as tolerance for the latter being a td instead, per OP's comment.

for t in x.findAll(text='price'):
  p = t.parent
  if p.name != 'th': continue
  ns = p.nextSibling
  if ns and not ns.name: ns = ns.nextSibling
  if not ns or ns.name not in ('td', 'th'): continue
  print ns.string

I've added ns.string, that will give the next sibling's contents if and only if they're just text (no further nested tags) -- of course you can instead analize further at this point, depends on your application's needs!-). Similarly, I imagine you won't be doing just print but something smarter, but I'm giving you the structure.

Talking about the structure, notice that twice I use if...: continue: this reduces nesting compared to the alternative of inverting the if's condition and indenting all the following statements in the loop -- and "flat is better than nested" is one of the koans in the Zen of Python (import this at an interactive prompt to see them all and meditate;-).

like image 82
Alex Martelli Avatar answered Sep 21 '22 08:09

Alex Martelli