As i want to remove duplicated placeholders in a html website, i use the .next_sibling operator of BeautifulSoup. As long as the duplicates are in the same line, this works fine (see data). But sometimes there is a empty line between them - so i want .next_sibling to ignore them (have a look at data2)
That is the code:
from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
"""
soup = BeautifulSoup(data)
string = 'method-removed-here'
for p in soup.find_all("p"):
while isinstance(p.next_sibling, Tag) and p.next_sibling.name== 'p' and p.text==string:
p.next_sibling.decompose()
print(soup)
Output for data is as expected:
<html><head></head><body><p>method-removed-here</p></body></html>
Output for data2 (this needs to be fixed):
<html><head></head><body><p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
</body></html>
I couldn't find useful information for that in the BeautifulSoup4 documentation and .next_element is also not what i am looking for.
In Beautiful Soup there is no in-built method to remove tags that has no content.
Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.
I could solve this issue with a workaround. The problem is described in the google-group for BeautifulSoup and they suggest to use a preprocessor for html-files:
def bs_preprocess(html):
"""remove distracting whitespaces and newline characters"""
pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
html = re.sub(pat, '', html) # remove leading and trailing whitespaces
html = re.sub('\n', ' ', html) # convert newlines to spaces
# this preserves newline delimiters
html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
return html
That's not the very best solution but one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With