Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup - combine consecutive tags

I have to work with the messiest HTML where individual words are split into separate tags, like in the following example:

<b style="mso-bidi-font-weight:normal"><span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span></b><b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span></b>

That's kind of hard to read, but basically the word "INTRODUCTION" is split into

<b><span>I</span></b> 

and

<b><span>NTRODUCTION</span></b>

having the same inline properties for both span and b tags.

What's a good way to combine these? I figured I'd loop through to find consecutive b tags like this, but am stuck on how I'd go about merging the consecutive b tags.

for b in soup.findAll('b'):
    try:
       if b.next_sibling.name=='b':
       ## combine them here??
    except:
        pass

Any ideas?

EDIT: Expected output is the following

<b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>INTRODUCTION</span></b>
like image 209
jma Avatar asked Mar 06 '23 08:03

jma


2 Answers

The solution below combines text from all the selected <b> tags into one <b> of your choice and decomposes the others.

If you only want to merge the text from consecutive tags follow Danny's approach.

Code:

from bs4 import BeautifulSoup

html = '''
<div id="wrapper">
  <b style="mso-bidi-font-weight:normal">
    <span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span>
  </b>
  <b style="mso-bidi-font-weight:normal">
    <span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span>
  </b>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
container = soup.select_one('#wrapper')  # it contains b tags to combine
b_tags = container.find_all('b')

# combine all the text from b tags
text = ''.join(b.get_text(strip=True) for b in b_tags)

# here you choose a tag you want to preserve and update its text
b_main = b_tags[0]  # you can target it however you want, I just take the first one from the list
b_main.span.string = text  # replace the text

for tag in b_tags:
    if tag is not b_main:
        tag.decompose()

print(soup)

Any comments appreciated.

like image 122
radzak Avatar answered Mar 20 '23 15:03

radzak


Perhaps you could check if the b.previousSibling is a b tag, then append the inner text from the current node into that. After doing this - you should be able to remove the current node from the tree with b.decompose.

like image 20
Danny Staple Avatar answered Mar 20 '23 17:03

Danny Staple