Regular Expression - Remove HTML comment spanning multiple line breaks

Question

I'm using this script:

http://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text

To convert some outlook HTML to plain text.

It nearly works, the only thing that it leaves behind is the CSS which outlook places in html comment tags  in addition to <style> tags (which are removed)

This is the original text:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0cm;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";
    mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:blue;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:purple;
    text-decoration:underline;}
span.EmailStyle17
    {mso-style-type:personal-compose;
    font-family:"Calibri","sans-serif";
    color:windowtext;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-family:"Calibri","sans-serif";
    mso-fareast-language:EN-US;}
@page WordSection1
    {size:612.0pt 792.0pt;
    margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
    {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal">tesst<o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:dimgray;mso-fareast-language:EN-GB">JOE BLOGS</span></b><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:dimgray;mso-fareast-language:EN-GB">
</div>
</body>
</html>

This is the resulting text: (note the HTML comment has not been removed)

<!--
/* Font Definitions */
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0cm;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";
    mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:blue;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:purple;
    text-decoration:underline;}
span.EmailStyle17
    {mso-style-type:personal-compose;
    font-family:"Calibri","sans-serif";
    color:windowtext;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-family:"Calibri","sans-serif";
    mso-fareast-language:EN-US;}
@page WordSection1
    {size:612.0pt 792.0pt;
    margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
    {page:WordSection1;}
-->

tesst
&nbsp;
JOE BLOGS

I have tried adapting the StripHTML() function with the additional replaces - but these did not work either.

result = System.Text.RegularExpressions.Regex.Replace(result, "(<!--).*?(-->)", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "<!--*-->", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)

Please help - this was a 2 minute job that i've been stuck on since lunch facedesk

Cheers

Edit 1: also tried the following - still no joy

result = System.Text.RegularExpressions.Regex.Replace(result, "<!--.*-->", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "<!--.*?-->", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)

Edit 2: I noticed this question was getting a lot of views, anyone reading this should definitely think twice about taking the regExp approach, instead i recommend using Lynx (OpenSource text based browser) to convert HTML to plain text, i asked a similar question here and i provide sample code in the edits based on the answers that should get you started using lynx.exe from within a .net application. This is the method we ended up using and haven't had any problems since.

Mark Byers · Accepted Answer

Your second regular expression for three reasons:

You need to use . to match any character.
The * is greedy. You want *? to match lazily.
You need RegexOptions.Singleline.

Try this:

result = Regex.Replace(result, "<!--.*?-->", "", RegexOptions.Singleline);

I would strongly recommend that you do not use regular expressions to parse HTML. You will save yourself a whole world of pain if you instead use HTML Agility Pack.

Regular Expression - Remove HTML comment spanning multiple line breaks

Tags:

c#

regex

replace

vb.net

HeavenCore

1 Answers

Mark Byers

Recent Activity

Donate For Us

Regular Expression - Remove HTML comment spanning multiple line breaks

Tags:

c#

regex

replace

vb.net

HeavenCore

1 Answers

Mark Byers

Related questions

Recent Activity

Donate For Us