This would come in handy when converting from Markdown to HTML, for example, if one needs to prevent comments from appearing in the final HTML source. Example input <code>my.md</code>: <pre class="prettyprint"><code># Contract Cancellation Dear Contractor X, due to delays in our imports, we would like to ...  best, me  </code></pre> Example output <code>my-filtered.md</code>: <pre class="prettyprint"><code># Contract Cancellation Dear Contractor X, due to delays in our imports, we would like to ... best, me </code></pre> On Linux, I would do something like this: <pre class="prettyprint"><code>cat my.md | remove_html_comments > my-filtered.md </code></pre> I am also able to write an AWK script that handles some common cases, but as I understood, neither AWK nor any other of the common tools for simple text manipulation (like <code>sed</code>) are really up to this job. One would need to use an HTML parser. How to write a proper <code>remove_html_comments</code> script, and with what tools?

I see from your comment that you mostly use Pandoc. Pandoc version 2.0, released October 29, 2017, adds a new option <code>--strip-comments</code>. The related issue provides some context to this change. Upgrading to the latest version and adding <code>--strip-comments</code> to your command should remove HTML comments as part of the conversion process.

It might be a bit counter-intuitive, bud i would use a HTML parser. Example with Python and BeautifulSoup: <pre class="prettyprint"><code>import sys from bs4 import BeautifulSoup, Comment md_input = sys.stdin.read() soup = BeautifulSoup(md_input, "html5lib") for element in soup(text=lambda text: isinstance(text, Comment)): element.extract() # bs4 wraps the text in <html><head></head><body>…</body></html>, # so we need to extract it: output = "".join(map(str, soup.find("body").contents)) print(output) </code></pre> Output: <pre class="prettyprint"><code>$ cat my.md | python md.py # Contract Cancellation Dear Contractor X, due to delays in our imports, we would like to ... best, me </code></pre> It shouldn't break any other HTML you might have in your .md files (it might change the code formatting a bit, but not it's meaning): <img src="https://i.stack.imgur.com/VNoEi.png" alt="enter image description here"> Of course test it thouroughly if you decide to use it. Edit – Try it out online here: https://repl.it/NQgG (input is read from input.md, not stdin)

Remove HTML comments from Markdown file

Tags:

html

bash

markdown

awk

pandoc

This would come in handy when converting from Markdown to HTML, for example, if one needs to prevent comments from appearing in the final HTML source.

Example input my.md:

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...
<!--
    ... due to a general shortage in the Y market
    TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->

Example output my-filtered.md:

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

On Linux, I would do something like this:

cat my.md | remove_html_comments > my-filtered.md

I am also able to write an AWK script that handles some common cases, but as I understood, neither AWK nor any other of the common tools for simple text manipulation (like sed) are really up to this job. One would need to use an HTML parser.

How to write a proper remove_html_comments script, and with what tools?

357

asked Oct 26 '17 10:10

hoijui

2 Answers

I see from your comment that you mostly use Pandoc.

Pandoc version 2.0, released October 29, 2017, adds a new option --strip-comments. The related issue provides some context to this change.

Upgrading to the latest version and adding --strip-comments to your command should remove HTML comments as part of the conversion process.

answered Oct 27 '22 22:10

Chris

It might be a bit counter-intuitive, bud i would use a HTML parser.

Example with Python and BeautifulSoup:

import sys
from bs4 import BeautifulSoup, Comment

md_input = sys.stdin.read()

soup = BeautifulSoup(md_input, "html5lib")

for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()

# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:

output = "".join(map(str, soup.find("body").contents))

print(output)

Output:

$ cat my.md | python md.py 
# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

It shouldn't break any other HTML you might have in your .md files (it might change the code formatting a bit, but not it's meaning):

enter image description here

Of course test it thouroughly if you decide to use it.

Edit – Try it out online here: https://repl.it/NQgG (input is read from input.md, not stdin)

answered Oct 27 '22 22:10

helb

Related questions
                            
                                Flexbox Grid System w/Margin Gutters
                            
                                Problems saving data passed through Redux when page refreshes or changes
                            
                                XAMPP Server Error (Error 500)
                            
                                R Shiny Image without padding/ stretched across page using css
                            
                                How to prevent a semicolon from being entered into html text input, but allowing a colon?
                            
                                Change color when user has scrolled down enough and then back
                            
                                Background pixelated
                            
                                Bootstrap 4 SCSS overrides not working
                            
                                How to dynamically create a new div using v-for in Vue.js?
                            
                                keep keyboard open on Ionic when button click ( chat app )
                            
                                leaflet remove specific marker
                            
                                Using *ngFor in CSS Grid Layout Undesirably Displaying Everything in One Column
                            
                                How to generate an addition equation for a number using only required set of numbers?
                            
                                Why won't my XPath select link/button based on its label text?
                            
                                How to format HTML code in VScode ?
                            
                                select <li> that does not have <a>
                            
                                How to load an html webpage inside unity3d
                            
                                Click anywhere to close side navbar javascript
                            
                                speed up canvas's getImageData
                            
                                Round border with gradient color

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With