Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python BeautifulSoup extract text between element

I try to extract "THIS IS MY TEXT" from the following HTML:

<html> <body> <table>    <td class="MYCLASS">       <!-- a comment -->       <a hef="xy">Text</a>       <p>something</p>       THIS IS MY TEXT       <p>something else</p>       </br>    </td> </table> </body> </html> 

I tried it this way:

soup = BeautifulSoup(html)  for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):     print hit.text 

But I get all the text between all nested Tags plus the comment.

Can anyone help me to just get "THIS IS MY TEXT" out of this?

like image 833
ɥɔǝnq ɹǝƃloɥ Avatar asked May 30 '13 11:05

ɥɔǝnq ɹǝƃloɥ


People also ask

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.

What does Soup prettify do?

The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: Python3.


2 Answers

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example

from BeautifulSoup import BeautifulSoup  doc = ['<html><head><title>Page title</title></head>',        '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',        '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',        '</html>'] soup = BeautifulSoup(''.join(doc))  print soup.prettify() # <html> #  <head> #   <title> #    Page title #   </title> #  </head> #  <body> #   <p id="firstpara" align="center"> #    This is paragraph #    <b> #     one #    </b> #    . #   </p> #   <p id="secondpara" align="blah"> #    This is paragraph #    <b> #     two #    </b> #    . #   </p> #  </body> # </html> 

To move down the parse tree you have contents and string.

  • contents is an ordered list of the Tag and NavigableString objects contained within a page element

  • if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]

For the above, that is to say you can get

soup.b.string # u'one' soup.b.contents[0] # u'one' 

For several children nodes, you can have for instance

pTag = soup.p pTag.contents # [u'This is paragraph ', <b>one</b>, u'.'] 

so here you may play with contents and get contents at the index you want.

You also can iterate over a Tag, this is a shortcut. For instance,

for i in soup.body:     print i # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p> # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p> 
like image 130
kiriloff Avatar answered Sep 18 '22 05:09

kiriloff


Use .children instead:

from bs4 import NavigableString, Comment print ''.join(unicode(child) for child in hit.children      if isinstance(child, NavigableString) and not isinstance(child, Comment)) 

Yes, this is a bit of a dance.

Output:

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): ...     print ''.join(unicode(child) for child in hit.children  ...         if isinstance(child, NavigableString) and not isinstance(child, Comment)) ...            THIS IS MY TEXT 
like image 20
Martijn Pieters Avatar answered Sep 21 '22 05:09

Martijn Pieters