How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

Tags:

As i want to remove duplicated placeholders in a html website, i use the .next_sibling operator of BeautifulSoup. As long as the duplicates are in the same line, this works fine (see data). But sometimes there is a empty line between them - so i want .next_sibling to ignore them (have a look at data2)

That is the code:

from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
"""
soup = BeautifulSoup(data)
string = 'method-removed-here'
for p in soup.find_all("p"):
    while isinstance(p.next_sibling, Tag) and p.next_sibling.name== 'p' and p.text==string:
        p.next_sibling.decompose()
print(soup)

Output for data is as expected:

<html><head></head><body><p>method-removed-here</p></body></html>

Output for data2 (this needs to be fixed):

<html><head></head><body><p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
</body></html>

I couldn't find useful information for that in the BeautifulSoup4 documentation and .next_element is also not what i am looking for.

421

asked Apr 23 '14 10:04

svenwildermann-msft

1 Answers

I could solve this issue with a workaround. The problem is described in the google-group for BeautifulSoup and they suggest to use a preprocessor for html-files:

 def bs_preprocess(html):
     """remove distracting whitespaces and newline characters"""
     pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
     html = re.sub(pat, '', html)       # remove leading and trailing whitespaces
     html = re.sub('\n', ' ', html)     # convert newlines to spaces
                                        # this preserves newline delimiters
     html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
     html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
     return html

That's not the very best solution but one.

answered Oct 13 '22 20:10

svenwildermann-msft

Related questions
                            
                                Add legends to LineCollection plot
                            
                                Cannot import GeoIP module in Django
                            
                                What's the difference between pass and continue in python [duplicate]
                            
                                Resample daily pandas timeseries with start at time other than midnight [duplicate]
                            
                                Append tuples to a tuples
                            
                                Getting top 3 rows that have biggest sum of columns in `pandas.DataFrame`?
                            
                                Manipulating browser (window) size using Splinter
                            
                                Python: how to do lazy debug logging
                            
                                Python code output to a file and add timestamp to filename
                            
                                Django: loaddata not working
                            
                                How to remove scheme from url in Python?
                            
                                Trying to install Couchbase, with gcc command fails, Python
                            
                                scapy: Operation not permitted when sending packets
                            
                                How to unit test a function that does not return anything?
                            
                                Python - concatenate 2 lists
                            
                                How to avoid e-05 in python
                            
                                xpath how to get before the last element of <a>
                            
                                TA-Lib numpy "AssertionError: real is not double"
                            
                                Django loaddata - Out of Memory
                            
                                Set DJANGO_SETTINGS_MODULE as an Environment Variable in Windows permanently

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

Tags:

python

html-parsing

beautifulsoup

svenwildermann-msft

People also ask

1 Answers

svenwildermann-msft

Recent Activity

Donate For Us