wrapping subsections of text with tags in BeautifulSoup

Question

I want the BeautifulSoup equivalent of this jQuery question.

I'd like to find a particular regex match in BeautifulSoup text and then replace that segment of text with a wrapped version. I can do this with plaintext wrapping:

# replace all words ending in "ug" wrapped in quotes,
# with "ug" replaced with "ook"

>>> soup = BeautifulSoup("Snug as a bug in a rug")
>>> soup
<html><body><p>Snug as a bug in a rug</p></body></html>
>>> for text in soup.findAll(text=True):
...   if re.search(r'ug\b',text):
...     text.replaceWith(re.sub(r'(\w*)ug\b',r'"\1ook"',text))
...
u'Snug as a bug in a rug'
>>> soup
<html><body><p>"Snook" as a "book" in a "rook"</p></body></html>

But what if I want boldface rather than quotes? e.g. desired result =

<html><body><p><b>Snook</b> as a <b>book</b> in a <b>rook</b></p></body></html>

roippi · Accepted Answer

for text in soup.findAll(text=True):
   if re.search(r'ug\b',text):
     text.replaceWith(BeautifulSoup(re.sub(r'(\w*)ug\b',r'<b>\1ook</b>',text),'html.parser'))

soup
Out[117]: <html><body><p><b>Snook</b> as a <b>book</b> in a <b>rook</b></p></body></html>

The idea here is that we're replacing a tag with a fully-formed parse tree. The easiest way to do that is to just call BeautifulSoup on our regex-subbed string.

The somewhat-magical 'html.parser' argument to the inner BeautifulSoup call is to prevent it from adding <html><body><p> tags, like bs4 (well, lxml really) normally does. More reading on that.

mdadm · Answer

So here is one way to do it. You could use regex to create new HTML with the words surrounded by boldface, throw that into the BeautifulSoup constructor, and replace the entire parent p with the new p tag.

import bs4
import re

soup = bs4.BeautifulSoup("Snug as a bug in a rug")
print soup

for text in soup.findAll(text=True):
    if re.search(r'ug\b',text):
        new_html = "<p>"+re.sub(r'(\w*)ug\b', r'<b>\1ook</b>', text)+"</p>"
        new_soup = bs4.BeautifulSoup(new_html)
        text.parent.replace_with(new_soup.p)

print soup

Another option would be to use the soup.new_tag method, but that might require a nested for loop, which won't be as elegant. I'll see if I can write it up and post it here later.

wrapping subsections of text with tags in BeautifulSoup

Tags:

python

html

regex

beautifulsoup

Jason S

2 Answers

roippi

mdadm

Recent Activity

Donate For Us

wrapping subsections of text with tags in BeautifulSoup

Tags:

python

html

regex

beautifulsoup

Jason S

2 Answers

roippi

mdadm

Related questions

Recent Activity

Donate For Us