This would come in handy when converting from Markdown to HTML, for example, if one needs to prevent comments from appearing in the final HTML source.
Example input my.md
:
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
<!--
... due to a general shortage in the Y market
TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->
Example output my-filtered.md
:
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
On Linux, I would do something like this:
cat my.md | remove_html_comments > my-filtered.md
I am also able to write an AWK script that handles some common cases,
but as I understood, neither AWK nor any other of the common tools for simple text manipulation (like sed
) are really up to this job. One would need to use an HTML parser.
How to write a proper remove_html_comments
script, and with what tools?
Markdown is a markup language which converts plain text into HTML code. It allows users to use special characters like asterisk, number sign, underscore and dashes in Markdown syntax instead of HTML.
There's nothing too much to explain this feature. It does what the title says, removes every HTML comment. Everything written between the <! -- beginning and --> closing tag is considered a comment.
[//]: <> (This is also a comment.) For maximum portability it is important to insert a blank line before and after this type of comments, because some Markdown parsers do not work correctly when definitions brush up against regular text.
Markdown doesn't include specific syntax for comments, but there is a workaround using the reference style links syntax. Using this syntax, the comments will not be output to the resulting HTML. Each of these lines works the same way: [...]: identifies a reference link (that won't be used in the article)
I see from your comment that you mostly use Pandoc.
Pandoc version 2.0, released October 29, 2017, adds a new option --strip-comments
. The related issue provides some context to this change.
Upgrading to the latest version and adding --strip-comments
to your command should remove HTML comments as part of the conversion process.
It might be a bit counter-intuitive, bud i would use a HTML parser.
Example with Python and BeautifulSoup:
import sys
from bs4 import BeautifulSoup, Comment
md_input = sys.stdin.read()
soup = BeautifulSoup(md_input, "html5lib")
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:
output = "".join(map(str, soup.find("body").contents))
print(output)
Output:
$ cat my.md | python md.py
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
It shouldn't break any other HTML you might have in your .md files (it might change the code formatting a bit, but not it's meaning):
Of course test it thouroughly if you decide to use it.
Edit – Try it out online here: https://repl.it/NQgG (input is read from input.md, not stdin)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With