Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Only extracting text from this element, not its children

I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:

I have

import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>') print soup.text 

The output to this is yesno. I want simply 'yes'.

What's the best way of achieving this?

Edit: I also want yes to be output when parsing '<html><b>no</b>yes</html>'.

like image 435
Dragon Avatar asked Feb 14 '11 17:02

Dragon


People also ask

How do you only get text in BeautifulSoup?

Use the 'P' tag to extract paragraphs from the Beautifulsoup object. Get text from the HTML document with get_text().

How do I remove tags from BeautifulSoup?

Remove tags with extract() BeautifulSoup has a built in method called extract() that allows you to remove a tag or string from the tree. Once you've located the element you want to get rid of, let's say it's named i_tag , calling i_tag. extract() will remove the element and return it at the same time.

How do you find a href in BeautifulSoup?

To get href with Python BeautifulSoup, we can use the find_all method. to create soup object with BeautifulSoup class called with the html string. Then we find the a elements with the href attribute returned by calling find_all with 'a' and href set to True .


2 Answers

what about .find(text=True)?

>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True) u'yes' >>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True) u'no' 

EDIT:

I think that I've understood what you want now. Try this:

>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False) u'yes' >>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False) u'yes' 
like image 81
jbochi Avatar answered Sep 21 '22 04:09

jbochi


You could use contents

>>> print soup.html.contents[0] yes 

or to get all the texts under html, use findAll(text=True, recursive=False)

>>> soup = BeautifulSoup.BeautifulSOAP('<html>x<b>no</b>yes</html>') >>> soup.html.findAll(text=True, recursive=False)  [u'x', u'yes'] 

above joined to form a single string

>>> ''.join(soup.html.findAll(text=True, recursive=False))  u'xyes' 
like image 25
TigrisC Avatar answered Sep 21 '22 04:09

TigrisC