I have a sed command that I want to run on a huge, terrible, ugly HTML file that was created from a Microsoft Word document. All it should do is remove any instance of the string
style='text-align:center; color:blue;
exampleStyle:exampleValue'
The sed command that I am trying to modify is
sed "s/ style='[^']*'//" fileA > fileB
It works great, except that whenever there is a new line inside of the matching text, it doesn't match. Is there a modifier for sed, or something I can do to force matching of any character, including newlines?
I understand that regexps are terrible at XML and HTML, blah blah blah, but in this case, the string patterns are well-formed in that the style attributes always start with a single quote and end with a single quote. So if I could just solve the newline problem, I could cut down the size of the HTML by over 50% with just that one command.
In the end, it turned out that Sinan Ünür's perl script worked best. It was almost instantaneous, and it reduced the file size from 2.3 MB to 850k. Good ol' Perl...
\s – (lowercase s) matches a single whitespace character – space, newline, return, tab, form [ \n\r\t\f] . \S (upper case S) matches any non-whitespace character. \t , \n , \r – tab, newline, return. \d – decimal digit [0-9] (some older regex utilities do not support but \d , but they all support \w and \s )
Special Characters The special character in sed are the same as those in grep, with one key difference: the forward slash / is a special character in sed. The reason for this will become very clear when studying sed commands.
The sed command has longlist of supported operations that can be performed to ease the process of editing text files. It allows the users to apply the expressions that are usually used in programming languages; one of the core supported expressions is Regular Expression (regex).
sed
goes over the input file line by line which means, as I understand, what you want is not possible in sed
.
You could use the following Perl script (untested), though:
#!/usr/bin/perl
use strict;
use warnings;
{
local $/; # slurp mode
my $html = <>;
$html =~ s/ style='[^']*'//g;
print $html;
}
__END__
A one liner would be:
$ perl -e 'local $/; $_ = <>; s/ style=\047[^\047]*\047//g; print' fileA > fileB
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With