I try to extract "THIS IS MY TEXT" from the following HTML: <pre class="prettyprint"><code><html> <body> <table> <td class="MYCLASS">  <a hef="xy">Text</a> something THIS IS MY TEXT something else </td> </table> </body> </html> </code></pre> I tried it this way: <pre class="prettyprint"><code>soup = BeautifulSoup(html) for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): print hit.text </code></pre> But I get all the text between all nested Tags plus the comment. Can anyone help me to just get "THIS IS MY TEXT" out of this?

Learn more about how to navigate through the parse tree in <code>BeautifulSoup</code>. Parse tree has got <code>tags</code> and <code>NavigableStrings</code> (as THIS IS A TEXT). An example <pre class="prettyprint"><code>from BeautifulSoup import BeautifulSoup doc = ['<html><head><title>Page title</title></head>', '<body>This is paragraph one.', 'This is paragraph two.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify() # <html> # <head> # <title> # Page title # </title> # </head> # <body> # # This is paragraph # # one # # . # # # This is paragraph # # two # # . # # </body> # </html> </code></pre> To move down the parse tree you have <code>contents</code> and <code>string</code>. <ul> <li><blockquote> contents is an ordered list of the Tag and NavigableString objects contained within a page element </blockquote></li> <li><blockquote> if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0] </blockquote></li> </ul> For the above, that is to say you can get <pre class="prettyprint"><code>soup.b.string # u'one' soup.b.contents[0] # u'one' </code></pre> For several children nodes, you can have for instance <pre class="prettyprint"><code>pTag = soup.p pTag.contents # [u'This is paragraph ', one, u'.'] </code></pre> so here you may play with <code>contents</code> and get contents at the index you want. You also can iterate over a Tag, this is a shortcut. For instance, <pre class="prettyprint"><code>for i in soup.body: print i # This is paragraph one. # This is paragraph two. </code></pre>

Use <code>.children</code> instead: <pre class="prettyprint"><code>from bs4 import NavigableString, Comment print ''.join(unicode(child) for child in hit.children if isinstance(child, NavigableString) and not isinstance(child, Comment)) </code></pre> Yes, this is a bit of a dance. Output: <pre class="prettyprint"><code>>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): ... print ''.join(unicode(child) for child in hit.children ... if isinstance(child, NavigableString) and not isinstance(child, Comment)) ... THIS IS MY TEXT </code></pre>

Python BeautifulSoup extract text between element

Tags:

python

beautifulsoup

I try to extract "THIS IS MY TEXT" from the following HTML:

<html> <body> <table>    <td class="MYCLASS">       <!-- a comment -->       <a hef="xy">Text</a>       <p>something</p>       THIS IS MY TEXT       <p>something else</p>       </br>    </td> </table> </body> </html>

I tried it this way:

soup = BeautifulSoup(html)  for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):     print hit.text

But I get all the text between all nested Tags plus the comment.

Can anyone help me to just get "THIS IS MY TEXT" out of this?

833

asked May 30 '13 11:05

ɥɔǝnq ɹǝƃloɥ

2 Answers

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example

from BeautifulSoup import BeautifulSoup  doc = ['<html><head><title>Page title</title></head>',        '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',        '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',        '</html>'] soup = BeautifulSoup(''.join(doc))  print soup.prettify() # <html> #  <head> #   <title> #    Page title #   </title> #  </head> #  <body> #   <p id="firstpara" align="center"> #    This is paragraph #    <b> #     one #    </b> #    . #   </p> #   <p id="secondpara" align="blah"> #    This is paragraph #    <b> #     two #    </b> #    . #   </p> #  </body> # </html>

To move down the parse tree you have contents and string.

contents is an ordered list of the Tag and NavigableString objects contained within a page element
if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]

For the above, that is to say you can get

soup.b.string # u'one' soup.b.contents[0] # u'one'

For several children nodes, you can have for instance

pTag = soup.p pTag.contents # [u'This is paragraph ', <b>one</b>, u'.']

so here you may play with contents and get contents at the index you want.

You also can iterate over a Tag, this is a shortcut. For instance,

for i in soup.body:     print i # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p> # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

130

answered Sep 18 '22 05:09

kiriloff

Use .children instead:

from bs4 import NavigableString, Comment print ''.join(unicode(child) for child in hit.children      if isinstance(child, NavigableString) and not isinstance(child, Comment))

Yes, this is a bit of a dance.

Output:

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): ...     print ''.join(unicode(child) for child in hit.children  ...         if isinstance(child, NavigableString) and not isinstance(child, Comment)) ...            THIS IS MY TEXT

answered Sep 21 '22 05:09

Martijn Pieters

Related questions
                            
                                How to clear the Entry widget after a button is pressed in Tkinter?
                            
                                How to create Password Field in Model Django
                            
                                understanding numpy's dstack function
                            
                                How I can get rid of None values in dictionary?
                            
                                How can I check if an object is an iterator in Python?
                            
                                How can I convert os.path.getctime()?
                            
                                TensorFlow: training on my own image
                            
                                How does a threading.Thread yield the rest of its quantum in Python?
                            
                                How to create tzinfo when I have UTC offset?
                            
                                How to pad with zeros a tensor along some axis (Python)
                            
                                How to add custom css file to Sphinx?
                            
                                How to limit log file size in python
                            
                                Matplotlib figure to image as a numpy array
                            
                                spark 2.1.0 session config settings (pyspark)
                            
                                Python/pyspark data frame rearrange columns
                            
                                ValueError: Dependency on app with no migrations: customuser
                            
                                How can one display an image using cv2 in Python
                            
                                Python SqlAlchemy order_by DateTime?
                            
                                How to save the Pandas dataframe/series data as a figure?
                            
                                Recursive unittest discover

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With