Logo Questions Linux Laravel Mysql Ubuntu Git Menu

delete html comment tags using regexp

This is how my text (html) file looks like
     |                                |
     |  This is a dummy comment       |
     |      please delete me          |
     |         asap                   |
     |                                |
     | -->

    this is another line 
    in this long dummy html file...
    please do not delete me

I'm trying to delete the comment using sed :

cat file.html | sed 's/.*<!--\(.*\)-->.*//g'

It doesn't work :( What am I doing wrong?

Thank you very much for your help!

like image 241
Zenet Avatar asked Oct 29 '10 20:10


3 Answers

patrickmdnet has the correct answer. Here it is on one line using extended regex:

cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'

Here is a good resource for learning more about sed. This sed is an adaptation of one-liner #92


like image 50
Brian Clements Avatar answered Nov 05 '22 02:11

Brian Clements

One problem with your original attempt is that your regex only handles comments that are entirely on one line. Also, the leading and trailing ".*" will remove non-comment text.

You would better off using existing code instead of rolling your own.


#! /bin/sed -f
# Delete HTML comments
# i.e. everything between <!-- and -->
# by Stewart Ravenhall <[email protected]>


(from http://sed.sourceforge.net/grabbag/scripts/)

See this link for various ways to use perl modules for removing HTML comments (using Regexp::Common, HTML::Parser, or File::Comments.) I am sure there are methods using other utilities.


like image 24
patrickmdnet Avatar answered Nov 05 '22 02:11


I think you can do this with awk if you want. Start:

[~] $ more test.txt

An HTML style comment 


Some other text


<!-- Whoops
     Another comment -->

Result of the awk:

[~]$ cat test.txt | awk '/<!--/ {off=1} /-->/ {off=2} /([\s\S]*)/ {if (off==0) print; if (off==2) off=0}'
Some other text


like image 37
eldarerathis Avatar answered Nov 05 '22 00:11
