I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python.
A sample HTML file looks something like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<!-- here is a comment inside the head tag -->
</head>
<body>
...
<!-- Comment inside body tag -->
<!-- Another comment inside body tag -->
<!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
</body>
</html>
<!-- This comment is the last line of the file -->
I figured out how to find the doctype and replace it with the tag <doctype>...</doctype>
, but the commenting is giving me a lot of frustration. I want to replace the HTML comments with <comment>...</comment>
. In this example HTML, I was able to replace the first two HTML comments, but anything inside the html
tag and the last comment after the closing html tag I was not.
Here is my code:
file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")
for child in soup.children:
# This takes care of the first two HTML comments
if isinstance(child, bs4.Comment):
child.replace_with("<comment>" + child.strip() + "</comment>")
# This should find all nested HTML comments and replace.
# It looks like it works but the changes are not finalized
if isinstance(child, bs4.Tag):
re.sub("(<!--)|(<!--)", "<comment>", child.text, flags=re.MULTILINE)
re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)
# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))
This is my first time using BeautifulSoup. How do I use BeautifulSoup to find and replace all HTML comments with the <comment>
tag?
Could I convert it to a byte stream, via pickle
, serializing it, applying regex, and then deseralize it back to a BeautifulSoup
object? Would this work or just cause more problems?
I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name'
.
Then I tried pickling just the text of the tag, via child.text
, but deserialization failed due to AttributeError: can't set attribute
. Basically, child.text
is read-only, which explains why the regex doesn't work. So, I have no idea how to modify the text.
You have a couple of problems:
You can't modify child.text
. it's a read-only property that just calls get_text()
behind the scenes, and its result is a brand new string unconnected to your document.
re.sub()
doesn't modify anything in-place. Your line
re.sub("(<!--)|(<!--)", "<comment>", child.text, flags=re.MULTILINE)
would have had to be
child.text = re.sub("(<!--)|(<!--)", "<comment>", child.text, flags=re.MULTILINE)
... but that wouldn't work anyway, because of point 1.
Trying to modify the document by replacing chunks of text in it with a regex is the wrong way to use BeautifulSoup. Instead, you need to find nodes and replace them with other nodes.
Here's a solution that works:
import bs4
with open("example.html") as f:
soup = bs4.BeautifulSoup(f)
for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
tag = bs4.Tag(name="comment")
tag.string = comment.strip()
comment.replace_with(tag)
This code starts by iterating over the result of a call to find_all()
, taking advantage of the fact that we can pass a function as the text
argument. In BeautifulSoup, Comment
is a subclass of NavigableString
, so we search for it as though it were a string, and the lambda ...
is just a shorthand for e.g.
def is_comment(e):
return isinstance(e, bs4.Comment)
soup.find_all(text=is_comment)
Then, we create a new Tag
with the appropriate name, set its content to be the stripped content of the original comment, and replace the comment with the tag we just created.
Here's the result:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<comment>here is a comment inside the head tag</comment>
</head>
<body>
...
<comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With