sed

Question

I am still a noob to shell scripts but am trying hard. Below, is a partially working shell script which is supposed to remove all JS from *.htm documents by matching tags and deleting their enclosed content. E.g. <script src="">, <script></script> and <script type="text/javascript">

find $1 -name "*.htm" > ./patterns
for p in $(cat ./patterns)
do
sed -e "s/<script.*[.>]//g" $p #> tmp.htm ; mv tmp.htm $p
done

The problem with this is script is that because sed reads text input line-by-line, this script will not work as expected with new-lines. Running:

<script>
//Foo
</script>

will remove the first script tag but will omit the "foo" and closing tag which I don't want.

Is there a way to match new-line characters in my regular expression? Or if sed is not appropriate, is there anything else I can use?

devnull · Accepted Answer

Assuming that you have <script> tags on different lines, e.g. something like:

foo
bar
<script type="text/javascript">
some JS
</script>
foo

the following should work:

sed '/<script/,/<\/script>/d' inputfile

suspectus · Answer

This awk script will look for the <script*> tag, set the in variable and then read the next line. When the closing </script*> tag is found the variable is set to zero. The final print pattern outputs all lines if the in variable is zero.

awk '/<script.*>/   { in=1; next }
     /<\/script.*>/ { if (in) in=0; next }
    { if (!in) print; } ' $1

sed - Include newline in pattern

Tags:

regex

shell

cygwin

GoofyBall

2 Answers

devnull

suspectus

Recent Activity

Donate For Us