Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In BeautifulSoup, Ignore Children Elements While Getting Parent Element Data

I have html as follows:

<html>
    <div class="maindiv">
        text data here 
        <br>
        continued text data
        <br>
        <div class="somename">
            text & data I want to omit
        </div>
    </div>
</html>

I am trying to get only the the text found in the maindivelement, without getting text data found in the somename element. In most cases, in my experience anyway, most text data is contained within some child element. I have ran into this particular case however where the data seems to be contained somewhat will-nilly and is a bit harder to filter.

My approach is as follows:

textdata= soup.find('div', class_='maindiv').get_text()

This gets all the text data found within the maindiv element, as well as the text data found in the somename div element.

The logic I'd like to use is more along the lines of: textdata = soup.find('div', class_='maindiv').get_text(recursive=False) which would omit any text data found within the somename element.

I know the recursive=False argument works for locating only parent-level elemenets when searching the DOM structure using BeautifulSoup, but can't be used with the .get_text() method.

I've realized the approach of finding all the text, then subtracting the string data found in the somename element from the string data found in the maindiv element, but I'm looking for something a little more efficient.

like image 518
alphazwest Avatar asked Nov 17 '16 16:11

alphazwest


People also ask

How do you get a child element in BeautifulSoup?

To get all the child nodes of an element in Beautiful Soup, use the find_all() method.

How do you find the parent element in BeautifulSoup?

Pass the HTML document into the Beautifulsoup() function. Get the tag or element in the document or HTML. Use the ". parent" function to find out the parent of any tag.

How do you get elements text in BeautifulSoup?

To extract text that is directly under an element in Beautiful Soup use the find_all(text=True, recursive=False) method. Here, note the following: The text=True means to look for text instead of elements. The recursive=False means to only search directly under the element.

How do you exclude tags in BeautifulSoup?

You can use extract() to remove unwanted tag before you get text. But it keeps all 'n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).


2 Answers

Not that far from your subtracting method, but one way to do it (at least in Python 3) is to discard all child divs.

s = soup.find('div', class_='maindiv')

for child in s.find_all("div"):
    child.decompose()

print(s.get_text())

Would print something like:

text data here

        continued text data

That might be a bit more efficient and flexible than subtracting the strings, though it still needs to go through the children first.

like image 65
Teemu Risikko Avatar answered Sep 25 '22 03:09

Teemu Risikko


from bs4 import BeautifulSoup
html ='''
<html>
    <div class="maindiv">
        text data here 
        <br>
        continued text data
        <br>
        <div class="somename">
            text & data I want to omit
        </div>
    </div>
</html>'''
soup = BeautifulSoup(html, 'lxml')

soup.find('div', class_="maindiv").next_element

out:

'\n        text data here \n        '
like image 21
宏杰李 Avatar answered Sep 27 '22 03:09

宏杰李