I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:
I have
import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>') print soup.text
The output to this is yesno
. I want simply 'yes'.
What's the best way of achieving this?
Edit: I also want yes
to be output when parsing '<html><b>no</b>yes</html>
'.
Use the 'P' tag to extract paragraphs from the Beautifulsoup object. Get text from the HTML document with get_text().
Remove tags with extract() BeautifulSoup has a built in method called extract() that allows you to remove a tag or string from the tree. Once you've located the element you want to get rid of, let's say it's named i_tag , calling i_tag. extract() will remove the element and return it at the same time.
To get href with Python BeautifulSoup, we can use the find_all method. to create soup object with BeautifulSoup class called with the html string. Then we find the a elements with the href attribute returned by calling find_all with 'a' and href set to True .
what about .find(text=True)
?
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True) u'yes' >>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True) u'no'
EDIT:
I think that I've understood what you want now. Try this:
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False) u'yes' >>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False) u'yes'
You could use contents
>>> print soup.html.contents[0] yes
or to get all the texts under html, use findAll(text=True, recursive=False)
>>> soup = BeautifulSoup.BeautifulSOAP('<html>x<b>no</b>yes</html>') >>> soup.html.findAll(text=True, recursive=False) [u'x', u'yes']
above joined to form a single string
>>> ''.join(soup.html.findAll(text=True, recursive=False)) u'xyes'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With