Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace HTML comments with custom <comment> elements

I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python.

A sample HTML file looks something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        ...
        <!-- here is a comment inside the head tag -->
    </head>
    <body>
        ...
        <!-- Comment inside body tag -->
        <!-- Another comment inside body tag -->
        <!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
    </body>
</html>
<!-- This comment is the last line of the file -->

I figured out how to find the doctype and replace it with the tag <doctype>...</doctype>, but the commenting is giving me a lot of frustration. I want to replace the HTML comments with <comment>...</comment>. In this example HTML, I was able to replace the first two HTML comments, but anything inside the html tag and the last comment after the closing html tag I was not.

Here is my code:

file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")

for child in soup.children:

    # This takes care of the first two HTML comments
    if isinstance(child, bs4.Comment):
        child.replace_with("<comment>" + child.strip() + "</comment>")

    # This should find all nested HTML comments and replace.
    # It looks like it works but the changes are not finalized
    if isinstance(child, bs4.Tag):
        re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
        re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)

# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))

This is my first time using BeautifulSoup. How do I use BeautifulSoup to find and replace all HTML comments with the <comment> tag?

Could I convert it to a byte stream, via pickle, serializing it, applying regex, and then deseralize it back to a BeautifulSoup object? Would this work or just cause more problems?

I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name'.

Then I tried pickling just the text of the tag, via child.text, but deserialization failed due to AttributeError: can't set attribute. Basically, child.text is read-only, which explains why the regex doesn't work. So, I have no idea how to modify the text.

like image 354
user3621633 Avatar asked Sep 28 '22 17:09

user3621633


1 Answers

You have a couple of problems:

  1. You can't modify child.text. it's a read-only property that just calls get_text() behind the scenes, and its result is a brand new string unconnected to your document.

  2. re.sub() doesn't modify anything in-place. Your line

    re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
    

    would have had to be

    child.text = re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
    

    ... but that wouldn't work anyway, because of point 1.

  3. Trying to modify the document by replacing chunks of text in it with a regex is the wrong way to use BeautifulSoup. Instead, you need to find nodes and replace them with other nodes.

Here's a solution that works:

import bs4

with open("example.html") as f:
    soup = bs4.BeautifulSoup(f)

for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
    tag = bs4.Tag(name="comment")
    tag.string = comment.strip()
    comment.replace_with(tag)

This code starts by iterating over the result of a call to find_all(), taking advantage of the fact that we can pass a function as the text argument. In BeautifulSoup, Comment is a subclass of NavigableString, so we search for it as though it were a string, and the lambda ... is just a shorthand for e.g.

def is_comment(e):
    return isinstance(e, bs4.Comment)

soup.find_all(text=is_comment)

Then, we create a new Tag with the appropriate name, set its content to be the stripped content of the original comment, and replace the comment with the tag we just created.

Here's the result:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        ...
        <comment>here is a comment inside the head tag</comment>
</head>
<body>
        ...
        <comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>
like image 126
Zero Piraeus Avatar answered Oct 05 '22 06:10

Zero Piraeus