I need to remove all tags from a html with a bash script using the sed command. I tried with this
sed -r 's/[\<][\/]?[a-zA-Z0-9\=\"\-\#\.\& ]+[\/]?[\>]//g' $1   and whith this
sed -r 's/[\<][\/]?[.]*[\/]?[\\]?[\>]//g' $1   but I still miss something, any suggestions??
You can either use one of the many HTML to text converters, use Perl regex if possible <.+?> or if it must be sed use <[^>]*> 
sed -e 's/<[^>]*>//g' file.html   If there's no room for errors, use an HTML parser instead. E.g. when an element is spread over two lines
<div >Lorem ipsum</div>   this regular expression will not work.
This regular expression consists of three parts <, [^>]*, > 
< *, which are not the closing >[...] is a character class, when it starts with ^ look for characters not in the class> The simpler regular expression <.*> will not work, because it searches for the longest possible match, i.e. the last closing > in an input line. E.g., when you have more than one tag in an input line 
<name>Olaf</name> answers questions.   will result in
answers questions.
instead of
Olaf answers questions.
See also Repetition with Star and Plus, especially section Watch Out for The Greediness! and following, for a detailed explanation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With