Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find by Text and Replace in HTML BeautifulSoup

I'm trying to mark up an HTML file (literally wrapping strings in "mark" tags) using python and BeautifulSoup. The problem is basically as follows...

Say I have my original html document:

test = "<h1>oh hey</h1><div>here is some <b>SILLY</b> text</div>"

I want to do a case-insensitive search for a string in this document (ignoring HTML) and wrap it in "mark" tags. So let's say I want to find "here is some silly text" in the html (ignoring the bold tags). I'd like to take the matching html and wrap it in "mark" tags.

For example, if I want to search for "here is some silly text" in test, the desired output is:

"<h1>oh hey</h1><div><mark>here is some <b>SILLY</b> text</mark></div>"

Any ideas? If it's more appropriate to use lxml or regular expressions, I'm open to those solutions as well.

like image 596
follyroof Avatar asked May 28 '13 19:05

follyroof


1 Answers

>>> soup = bs4.BeautifulSoup(test)
>>> matches = soup.find_all(lambda x: x.text.lower() == 'here is some silly text')
>>> for match in matches:
...     match.wrap(soup.new_tag('mark'))
>>> soup
<html><body><h1>oh hey</h1><mark><div>here is some <b>SILLY</b> text</div></mark></body></html>

The reason I had to pass a function as the name to find_all that compares x.text.lower(), instead of just using the text argument with a function that compares x.lower(), is that the latter will not find the content in some cases that you apparently want.

The wrap function may not work this way in some cases. If it doesn't, you will have to instead enumerate(matches), and set matches[i] = match.wrap(soup.new_tag('mark')). (You can't use replace_with to replace a tag with a new tag that references itself.)

Also note that if your intended use case allows any non-ASCII string to ever match 'here is some silly text' (or if you want to broaden the code to handle non-ASCII search strings), the code above using lower() may be incorrect. You may want to call str.casefold() and/or locale.strxfrm(s) and/or use locale.strcoll(s, t) instead of using ==, but you'll have to understand what you want and how to get it to pick the right answer.

like image 174
abarnert Avatar answered Sep 26 '22 22:09

abarnert