Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sed - Include newline in pattern

I am still a noob to shell scripts but am trying hard. Below, is a partially working shell script which is supposed to remove all JS from *.htm documents by matching tags and deleting their enclosed content. E.g. <script src="">, <script></script> and <script type="text/javascript">

find $1 -name "*.htm" > ./patterns
for p in $(cat ./patterns)
do
sed -e "s/<script.*[.>]//g" $p #> tmp.htm ; mv tmp.htm $p
done

The problem with this is script is that because sed reads text input line-by-line, this script will not work as expected with new-lines. Running:

<script>
//Foo
</script>

will remove the first script tag but will omit the "foo" and closing tag which I don't want.

Is there a way to match new-line characters in my regular expression? Or if sed is not appropriate, is there anything else I can use?

like image 950
GoofyBall Avatar asked Jul 16 '13 08:07

GoofyBall


2 Answers

Assuming that you have <script> tags on different lines, e.g. something like:

foo
bar
<script type="text/javascript">
some JS
</script>
foo

the following should work:

sed '/<script/,/<\/script>/d' inputfile
like image 161
devnull Avatar answered Sep 24 '22 14:09

devnull


This awk script will look for the <script*> tag, set the in variable and then read the next line. When the closing </script*> tag is found the variable is set to zero. The final print pattern outputs all lines if the in variable is zero.

awk '/<script.*>/   { in=1; next }
     /<\/script.*>/ { if (in) in=0; next }
    { if (!in) print; } ' $1
like image 42
suspectus Avatar answered Sep 24 '22 14:09

suspectus