Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text node inside a tag that has a child element in beautifulsoup4

The HTML that I am parsing and scraping has the following code:

<li> <span> 929</span> Serve Returned </li>

How can I extract just the text node of <li>, "serve returned" in this case with Beautifulsoup?

.string doesn't work since <li> has a child element, and .text returns the text inside <span>.

like image 690
user3562812 Avatar asked Apr 22 '15 20:04

user3562812


2 Answers

import bs4
html = r"<li> <span> 929</span> Serve Returned </li>"
soup = bs4.BeautifulSoup(html)
print soup.li.findAll(text=True, recursive=False)

This gives:

[u' ', u' Serve Returned ']

The first element is the "text" you have before the span. This method could help you find text before and after (and in-between) any child elements.

like image 178
Hooked Avatar answered Oct 24 '22 10:10

Hooked


I used the str.replace method for this:

>>> li = soup.find('li') # or however you need to drill down to the <li> tag 
>>> mytext = li.text.replace(li.find('span').text, "") 
>>> print mytext
Serve Returned
like image 38
Totem Avatar answered Oct 24 '22 11:10

Totem