I'm using Beautiful Soup for replacing text.
Here's an example my code:
for x in soup.find('body').find_all(string=True):
fix_str = re.sub(...)
x.replace_with(fix_str)
How do I skip the script and comment (<--! -->) tags?
How can I determine which elements or tag are in x?
If you take the parent item for each text item you get, you can then determine whether or not it comes from within a <script> tag or from an HTML comment. If not, the text can then be used to call replace_with() using your re.sub() function:
from bs4 import BeautifulSoup, Comment
html = """<html>
<head>
<!-- a comment -->
<title>A title</title>
<script>a script</script>
</head>
<body>
Some text 1
<!-- a comment -->
<!-- a comment -->
Some text 2
<!-- a comment -->
<script>a script</script>
Some text 2
</body>
</html>"""
soup = BeautifulSoup(html, "html.parser")
for text in soup.body.find_all(string=True):
if text.parent.name != 'script' and not isinstance(text, Comment):
text.replace_with('new text') # add re.sub() logic here
print soup
Giving you the following new HTML:
<html>
<head>
<!-- a comment -->
<title>A title</title>
<script>a script</script>
</head>
<body>new text<!-- a comment -->new text<!-- a comment -->new text<!-- a comment -->new text<script>a script</script>new text</body>
</html>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With