<p>I have html as follows:</p> <pre class="prettyprint"><code><html> <div class="maindiv"> text data here <br> continued text data <br> <div class="somename"> text & data I want to omit </div> </div> </html> </code></pre> <p>I am trying to get only the the text found in the <code>maindiv</code>element, without getting text data found in the <code>somename</code> element. In most cases, in my experience anyway, most text data is contained within some child element. I have ran into this particular case however where the data seems to be contained somewhat <em>will-nilly</em> and is a bit harder to filter.</p> <p>My approach is as follows:</p> <p><code>textdata= soup.find('div', class_='maindiv').get_text()</code></p> <p>This gets all the text data found within the <code>maindiv</code> element, as well as the text data found in the <code>somename</code> div element.</p> <p>The logic I'd like to use is more along the lines of: <code>textdata = soup.find('div', class_='maindiv').get_text(recursive=False)</code> which would omit any text data found within the <code>somename</code> element.</p> <p>I know the <code>recursive=False</code> argument works for locating only parent-level elemenets when searching the DOM structure using BeautifulSoup, but can't be used with the <code>.get_text()</code> method.</p> <p>I've realized the approach of finding all the text, then subtracting the string data found in the <code>somename</code> element from the string data found in the <code>maindiv</code> element, but I'm looking for something a little more efficient.</p>

<p>Not that far from your subtracting method, but one way to do it (at least in Python 3) is to discard all child divs. </p> <pre class="prettyprint"><code>s = soup.find('div', class_='maindiv') for child in s.find_all("div"): child.decompose() print(s.get_text()) </code></pre> <p>Would print something like:</p> <pre class="prettyprint"><code>text data here continued text data </code></pre> <p>That might be a bit more efficient and flexible than subtracting the strings, though it still needs to go through the children first.</p>

<pre class="prettyprint"><code>from bs4 import BeautifulSoup html =''' <html> <div class="maindiv"> text data here <br> continued text data <br> <div class="somename"> text & data I want to omit </div> </div> </html>''' soup = BeautifulSoup(html, 'lxml') soup.find('div', class_="maindiv").next_element </code></pre> <p>out:</p> <pre class="prettyprint"><code>'\n text data here \n ' </code></pre>

In BeautifulSoup, Ignore Children Elements While Getting Parent Element Data

Tags:

python

html

beautifulsoup

I have html as follows:

<html>
    <div class="maindiv">
        text data here 
        <br>
        continued text data
        <br>
        <div class="somename">
            text & data I want to omit
        </div>
    </div>
</html>

I am trying to get only the the text found in the maindivelement, without getting text data found in the somename element. In most cases, in my experience anyway, most text data is contained within some child element. I have ran into this particular case however where the data seems to be contained somewhat will-nilly and is a bit harder to filter.

My approach is as follows:

textdata= soup.find('div', class_='maindiv').get_text()

This gets all the text data found within the maindiv element, as well as the text data found in the somename div element.

The logic I'd like to use is more along the lines of: textdata = soup.find('div', class_='maindiv').get_text(recursive=False) which would omit any text data found within the somename element.

I know the recursive=False argument works for locating only parent-level elemenets when searching the DOM structure using BeautifulSoup, but can't be used with the .get_text() method.

I've realized the approach of finding all the text, then subtracting the string data found in the somename element from the string data found in the maindiv element, but I'm looking for something a little more efficient.

518

asked Nov 17 '16 16:11

alphazwest

2 Answers

Not that far from your subtracting method, but one way to do it (at least in Python 3) is to discard all child divs.

s = soup.find('div', class_='maindiv')

for child in s.find_all("div"):
    child.decompose()

print(s.get_text())

Would print something like:

text data here

        continued text data

That might be a bit more efficient and flexible than subtracting the strings, though it still needs to go through the children first.

answered Sep 25 '22 03:09

Teemu Risikko

from bs4 import BeautifulSoup
html ='''
<html>
    <div class="maindiv">
        text data here 
        <br>
        continued text data
        <br>
        <div class="somename">
            text & data I want to omit
        </div>
    </div>
</html>'''
soup = BeautifulSoup(html, 'lxml')

soup.find('div', class_="maindiv").next_element

out:

'\n        text data here \n        '

answered Sep 27 '22 03:09

宏杰李

Related questions
                            
                                How to find instances that DONT have a tag using Boto3
                            
                                Sorting key function that uses custom comparison [duplicate]
                            
                                import all csv files in directory as pandas dfs and name them as csv filenames
                            
                                Cannot connect to SQL server from python using Active Directory Authentication
                            
                                Handling imported module Exceptions
                            
                                how to apply BREAK for Itertools count in List Comprehensions?
                            
                                Why conv2d in tensorflow gives an output has the same shape as input
                            
                                Using column header and values from one dataframe to find weights in another dataframe
                            
                                Are Python multiprocessing Pool thread safe?
                            
                                Find row in pandas and update specific value
                            
                                Read a large big-endian binary file
                            
                                convert csv to a string variable
                            
                                LabelBinarizer for multiple columns in data frame
                            
                                Python attributes and descriptors
                            
                                Simulate alt+tab in Python
                            
                                TypeError: object of type 'method' has no len() [closed]
                            
                                How to execute Python Code on Interpreter Startup in Virtualenv?
                            
                                use cntk trained model with python
                            
                                Why are there different Lemmatizers in NLTK library?
                            
                                subprocess.Popen - No such file or directory [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With