Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove HTML comments from Markdown file

This would come in handy when converting from Markdown to HTML, for example, if one needs to prevent comments from appearing in the final HTML source.

Example input my.md:

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...
<!--
    ... due to a general shortage in the Y market
    TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->

Example output my-filtered.md:

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

On Linux, I would do something like this:

cat my.md | remove_html_comments > my-filtered.md

I am also able to write an AWK script that handles some common cases, but as I understood, neither AWK nor any other of the common tools for simple text manipulation (like sed) are really up to this job. One would need to use an HTML parser.

How to write a proper remove_html_comments script, and with what tools?

like image 357
hoijui Avatar asked Oct 26 '17 10:10

hoijui


People also ask

What is markdown use for comments?

Markdown is a markup language which converts plain text into HTML code. It allows users to use special characters like asterisk, number sign, underscore and dashes in Markdown syntax instead of HTML.

How do you remove comments from HTML?

There's nothing too much to explain this feature. It does what the title says, removes every HTML comment. Everything written between the <! -- beginning and --> closing tag is considered a comment.

How do I comment out a MD file?

[//]: <> (This is also a comment.) For maximum portability it is important to insert a blank line before and after this type of comments, because some Markdown parsers do not work correctly when definitions brush up against regular text.

Are there comments in markdown?

Markdown doesn't include specific syntax for comments, but there is a workaround using the reference style links syntax. Using this syntax, the comments will not be output to the resulting HTML. Each of these lines works the same way: [...]: identifies a reference link (that won't be used in the article)


2 Answers

I see from your comment that you mostly use Pandoc.

Pandoc version 2.0, released October 29, 2017, adds a new option --strip-comments. The related issue provides some context to this change.

Upgrading to the latest version and adding --strip-comments to your command should remove HTML comments as part of the conversion process.

like image 54
Chris Avatar answered Oct 27 '22 22:10

Chris


It might be a bit counter-intuitive, bud i would use a HTML parser.

Example with Python and BeautifulSoup:

import sys
from bs4 import BeautifulSoup, Comment

md_input = sys.stdin.read()

soup = BeautifulSoup(md_input, "html5lib")

for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()

# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:

output = "".join(map(str, soup.find("body").contents))

print(output)

Output:

$ cat my.md | python md.py 
# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me 

It shouldn't break any other HTML you might have in your .md files (it might change the code formatting a bit, but not it's meaning):

enter image description here

Of course test it thouroughly if you decide to use it.

Edit – Try it out online here: https://repl.it/NQgG (input is read from input.md, not stdin)

like image 31
helb Avatar answered Oct 27 '22 22:10

helb