Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get text of HTML tags without text of inner child tags

Example:

Sometimes the HTML is:

<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>

Other times it's just:

<div id="1">
    this is the text i want here
</div>

I want to get only the text in the one tag, and ignore all other child tags. If I run the .text property, I get both.

like image 814
User Avatar asked May 11 '15 02:05

User


1 Answers

Another possible approach (I would make it in a function) :

def getText(parent):
    return ''.join(parent.find_all(text=True, recursive=False)).strip()

recursive=False indicates that you want only direct children, not nested ones. And text=True indicates that you want only text nodes.

Usage example :

from bs4 import BeautifulSoup

html = """<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>
"""
soup = BeautifulSoup(html)
print(getText(soup.div))
#this is the text i want here
like image 159
har07 Avatar answered Oct 06 '22 01:10

har07