I try to extract "THIS IS MY TEXT" from the following HTML:
<html> <body> <table> <td class="MYCLASS"> <!-- a comment --> <a hef="xy">Text</a> <p>something</p> THIS IS MY TEXT <p>something else</p> </br> </td> </table> </body> </html>
I tried it this way:
soup = BeautifulSoup(html) for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): print hit.text
But I get all the text between all nested Tags plus the comment.
Can anyone help me to just get "THIS IS MY TEXT" out of this?
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: Python3.
Learn more about how to navigate through the parse tree in BeautifulSoup
. Parse tree has got tags
and NavigableStrings
(as THIS IS A TEXT). An example
from BeautifulSoup import BeautifulSoup doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify() # <html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> # </html>
To move down the parse tree you have contents
and string
.
contents is an ordered list of the Tag and NavigableString objects contained within a page element
if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]
For the above, that is to say you can get
soup.b.string # u'one' soup.b.contents[0] # u'one'
For several children nodes, you can have for instance
pTag = soup.p pTag.contents # [u'This is paragraph ', <b>one</b>, u'.']
so here you may play with contents
and get contents at the index you want.
You also can iterate over a Tag, this is a shortcut. For instance,
for i in soup.body: print i # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p> # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
Use .children
instead:
from bs4 import NavigableString, Comment print ''.join(unicode(child) for child in hit.children if isinstance(child, NavigableString) and not isinstance(child, Comment))
Yes, this is a bit of a dance.
Output:
>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}): ... print ''.join(unicode(child) for child in hit.children ... if isinstance(child, NavigableString) and not isinstance(child, Comment)) ... THIS IS MY TEXT
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With