Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

wrapping subsections of text with tags in BeautifulSoup

I want the BeautifulSoup equivalent of this jQuery question.

I'd like to find a particular regex match in BeautifulSoup text and then replace that segment of text with a wrapped version. I can do this with plaintext wrapping:

# replace all words ending in "ug" wrapped in quotes,
# with "ug" replaced with "ook"

>>> soup = BeautifulSoup("Snug as a bug in a rug")
>>> soup
<html><body><p>Snug as a bug in a rug</p></body></html>
>>> for text in soup.findAll(text=True):
...   if re.search(r'ug\b',text):
...     text.replaceWith(re.sub(r'(\w*)ug\b',r'"\1ook"',text))
...
u'Snug as a bug in a rug'
>>> soup
<html><body><p>"Snook" as a "book" in a "rook"</p></body></html>

But what if I want boldface rather than quotes? e.g. desired result =

<html><body><p><b>Snook</b> as a <b>book</b> in a <b>rook</b></p></body></html>
like image 898
Jason S Avatar asked Mar 27 '14 22:03

Jason S


2 Answers

for text in soup.findAll(text=True):
   if re.search(r'ug\b',text):
     text.replaceWith(BeautifulSoup(re.sub(r'(\w*)ug\b',r'<b>\1ook</b>',text),'html.parser'))

soup
Out[117]: <html><body><p><b>Snook</b> as a <b>book</b> in a <b>rook</b></p></body></html>

The idea here is that we're replacing a tag with a fully-formed parse tree. The easiest way to do that is to just call BeautifulSoup on our regex-subbed string.

The somewhat-magical 'html.parser' argument to the inner BeautifulSoup call is to prevent it from adding <html><body><p> tags, like bs4 (well, lxml really) normally does. More reading on that.

like image 139
roippi Avatar answered Nov 07 '22 11:11

roippi


So here is one way to do it. You could use regex to create new HTML with the words surrounded by boldface, throw that into the BeautifulSoup constructor, and replace the entire parent p with the new p tag.

import bs4
import re

soup = bs4.BeautifulSoup("Snug as a bug in a rug")
print soup

for text in soup.findAll(text=True):
    if re.search(r'ug\b',text):
        new_html = "<p>"+re.sub(r'(\w*)ug\b', r'<b>\1ook</b>', text)+"</p>"
        new_soup = bs4.BeautifulSoup(new_html)
        text.parent.replace_with(new_soup.p)

print soup

Another option would be to use the soup.new_tag method, but that might require a nested for loop, which won't be as elegant. I'll see if I can write it up and post it here later.

like image 2
mdadm Avatar answered Nov 07 '22 10:11

mdadm