I have to work with the messiest HTML where individual words are split into separate tags, like in the following example:
<b style="mso-bidi-font-weight:normal"><span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span></b><b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span></b>
That's kind of hard to read, but basically the word "INTRODUCTION" is split into
<b><span>I</span></b>
and
<b><span>NTRODUCTION</span></b>
having the same inline properties for both span and b tags.
What's a good way to combine these? I figured I'd loop through to find consecutive b tags like this, but am stuck on how I'd go about merging the consecutive b tags.
for b in soup.findAll('b'):
try:
if b.next_sibling.name=='b':
## combine them here??
except:
pass
Any ideas?
EDIT: Expected output is the following
<b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>INTRODUCTION</span></b>
The solution below combines text from all the selected <b>
tags into one <b>
of your choice and decomposes the others.
If you only want to merge the text from consecutive tags follow Danny's approach.
Code:
from bs4 import BeautifulSoup
html = '''
<div id="wrapper">
<b style="mso-bidi-font-weight:normal">
<span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span>
</b>
<b style="mso-bidi-font-weight:normal">
<span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span>
</b>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
container = soup.select_one('#wrapper') # it contains b tags to combine
b_tags = container.find_all('b')
# combine all the text from b tags
text = ''.join(b.get_text(strip=True) for b in b_tags)
# here you choose a tag you want to preserve and update its text
b_main = b_tags[0] # you can target it however you want, I just take the first one from the list
b_main.span.string = text # replace the text
for tag in b_tags:
if tag is not b_main:
tag.decompose()
print(soup)
Any comments appreciated.
Perhaps you could check if the b.previousSibling
is a b
tag, then append the inner text from the current node into that. After doing this - you should be able to remove the current node from the tree with b.decompose
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With