I need to replace multiple words in a html document. Atm I am doing this by calling replace_with once for each replacement. Calling replace_with twice on a NavigableString leads to a ValueError (see example below) cause the replaced element is no longer in the tree.
Minimal example
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
def test1():
html = \
'''
Identify
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
if re.search('identify',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub('identify', '<a href="test.html"> test </a>', txt.lower())
txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
# I called it twice here to make the code as small as possible.
# Usually it would be a different newtext ..
# which was created using the replaced txt looking for a different word to replace.
return soup
print(test1())
Expected Result:
The txt is == newstring
Result:
ValueError: Cannot replace one element with another when the element to be replaced is not
part of the tree.
An easy solution would be just to tinker around with the newstring and only replacing all at once in the end, but I would like to understand the current phenomenon.
The first txt.replace_with(...)
removes NavigableString
(here stored in variable txt
) from the document tree (doc). This effectively sets txt.parent
to None
The second txt.replace_with(...)
looks at parent
property, finds None
(because txt
is already removed from tree) and throws an ValueError.
As you said at the end of your question, one the solution can be to use .replace_with()
only once:
import re
from bs4 import BeautifulSoup
def test1():
html = \
'''
word1 word2 word3 word4
'''
soup = BeautifulSoup(html,features="html.parser")
to_delete = []
for txt in soup.findAll(text=True):
if re.search('word1', txt, flags=re.I) and txt.parent.name != 'a':
newtext = re.sub('word1', '<a href="test.html"> test1 </a>', txt.lower())
# ...some computations
newtext = re.sub('word3', '<a href="test.html"> test2 </a>', newtext)
# ...some more computations
# and at the end, replce txt only once:
txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
return soup
print(test1())
Prints:
<a href="test.html"> test1 </a> word2 <a href="test.html"> test2 </a> word4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With