Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup skip comment and script tags

I'm using Beautiful Soup for replacing text.

Here's an example my code:

for x in soup.find('body').find_all(string=True):
   fix_str = re.sub(...)
   x.replace_with(fix_str)

How do I skip the script and comment (<--! -->) tags?

How can I determine which elements or tag are in x?

like image 392
User34 Avatar asked Oct 18 '22 03:10

User34


1 Answers

If you take the parent item for each text item you get, you can then determine whether or not it comes from within a <script> tag or from an HTML comment. If not, the text can then be used to call replace_with() using your re.sub() function:

from bs4 import BeautifulSoup, Comment

html = """<html>
<head>
<!-- a comment -->
<title>A title</title>
<script>a script</script>
</head>

<body>
Some text 1
<!-- a comment -->
<!-- a comment -->
Some text 2
<!-- a comment -->
<script>a script</script>
Some text 2
</body>
</html>"""

soup = BeautifulSoup(html, "html.parser")

for text in soup.body.find_all(string=True):
    if text.parent.name != 'script' and not isinstance(text, Comment):
        text.replace_with('new text')   # add re.sub() logic here

print soup

Giving you the following new HTML:

<html>
<head>
<!-- a comment -->
<title>A title</title>
<script>a script</script>
</head>
<body>new text<!-- a comment -->new text<!-- a comment -->new text<!-- a comment -->new text<script>a script</script>new text</body>
</html>
like image 167
Martin Evans Avatar answered Oct 21 '22 00:10

Martin Evans