How can I Convert HTML to Text in C#?

Q: How do you convert HTML to plain text in C#?

Replace("<br ", "\n<br "); sbHTML. Replace("<p ", "\n<p "); // Finally, remove all HTML tags and return plain text.

Q: How do I convert a Web page to text?

Click the “Save as” or “Save Page As” option and select “Text Files” from the Save as Type drop-down menu. Type a name for the text file and click “Save.” The text from the Web page will be extracted and saved as a text file that can be viewed in text editors and document programs such as Microsoft Word.

Tags:

html

c#

.net

text

parsing

I'm looking for C# code to convert an HTML document to plain text.

I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.

The output should look like this:

Html2Txt at W3C

I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?

EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt @ W3C produced. Too bad that source doesn't seem to be available. I was looking to see if there is a more "canned" solution available.

EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!

263

asked Apr 08 '09 20:04

Matt Crouch

1 Answers

Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.

using System.IO; using System.Text.RegularExpressions; using HtmlAgilityPack;  //small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs public static class HtmlToText {      public static string Convert(string path)     {         HtmlDocument doc = new HtmlDocument();         doc.Load(path);         return ConvertDoc(doc);     }      public static string ConvertHtml(string html)     {         HtmlDocument doc = new HtmlDocument();         doc.LoadHtml(html);         return ConvertDoc(doc);     }      public static string ConvertDoc (HtmlDocument doc)     {         using (StringWriter sw = new StringWriter())         {             ConvertTo(doc.DocumentNode, sw);             sw.Flush();             return sw.ToString();         }     }      internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)     {         foreach (HtmlNode subnode in node.ChildNodes)         {             ConvertTo(subnode, outText, textInfo);         }     }     public static void ConvertTo(HtmlNode node, TextWriter outText)     {         ConvertTo(node, outText, new PreceedingDomTextInfo(false));     }     internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)     {         string html;         switch (node.NodeType)         {             case HtmlNodeType.Comment:                 // don't output comments                 break;             case HtmlNodeType.Document:                 ConvertContentTo(node, outText, textInfo);                 break;             case HtmlNodeType.Text:                 // script and style must not be output                 string parentName = node.ParentNode.Name;                 if ((parentName == "script") || (parentName == "style"))                 {                     break;                 }                 // get text                 html = ((HtmlTextNode)node).Text;                 // is it in fact a special closing node output as text?                 if (HtmlNode.IsOverlappedClosingElement(html))                 {                     break;                 }                 // check the text is meaningful and not a bunch of whitespaces                 if (html.Length == 0)                 {                     break;                 }                 if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)                 {                     html= html.TrimStart();                     if (html.Length == 0) { break; }                     textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;                 }                 outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));                 if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))                 {                     outText.Write(' ');                 }                     break;             case HtmlNodeType.Element:                 string endElementString = null;                 bool isInline;                 bool skip = false;                 int listIndex = 0;                 switch (node.Name)                 {                     case "nav":                         skip = true;                         isInline = false;                         break;                     case "body":                     case "section":                     case "article":                     case "aside":                     case "h1":                     case "h2":                     case "header":                     case "footer":                     case "address":                     case "main":                     case "div":                     case "p": // stylistic - adjust as you tend to use                         if (textInfo.IsFirstTextOfDocWritten)                         {                             outText.Write("\r\n");                         }                         endElementString = "\r\n";                         isInline = false;                         break;                     case "br":                         outText.Write("\r\n");                         skip = true;                         textInfo.WritePrecedingWhiteSpace = false;                         isInline = true;                         break;                     case "a":                         if (node.Attributes.Contains("href"))                         {                             string href = node.Attributes["href"].Value.Trim();                             if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)                             {                                 endElementString =  "<" + href + ">";                             }                           }                         isInline = true;                         break;                     case "li":                          if(textInfo.ListIndex>0)                         {                             outText.Write("\r\n{0}.\t", textInfo.ListIndex++);                          }                         else                         {                             outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022                         }                         isInline = false;                         break;                     case "ol":                          listIndex = 1;                         goto case "ul";                     case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems                         endElementString = "\r\n";                         isInline = false;                         break;                     case "img": //inline-block in reality                         if (node.Attributes.Contains("alt"))                         {                             outText.Write('[' + node.Attributes["alt"].Value);                             endElementString = "]";                         }                         if (node.Attributes.Contains("src"))                         {                             outText.Write('<' + node.Attributes["src"].Value + '>');                         }                         isInline = true;                         break;                     default:                         isInline = true;                         break;                 }                 if (!skip && node.HasChildNodes)                 {                     ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });                 }                 if (endElementString != null)                 {                     outText.Write(endElementString);                 }                 break;         }     } } internal class PreceedingDomTextInfo {     public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)     {         IsFirstTextOfDocWritten = isFirstTextOfDocWritten;     }     public bool WritePrecedingWhiteSpace {get;set;}     public bool LastCharWasSpace { get; set; }     public readonly BoolWrapper IsFirstTextOfDocWritten;     public int ListIndex { get; set; } } internal class BoolWrapper {     public BoolWrapper() { }     public bool Value { get; set; }     public static implicit operator bool(BoolWrapper boolWrapper)     {         return boolWrapper.Value;     }     public static implicit operator BoolWrapper(bool boolWrapper)     {         return new BoolWrapper{ Value = boolWrapper };     } }

As an example, the following HTML code...

<!DOCTYPE HTML> <html>     <head>     </head>     <body>         <header>             Whatever Inc.         </header>         <main>             <p>                 Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:             </p>             <ol>                 <li>                     Please confirm this is your email by replying.                 </li>                 <li>                     Then perform this step.                 </li>             </ol>             <p>                 Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:             </p>             <ul>                 <li>                     a point.                 </li>                 <li>                     another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>.                 </li>             </ul>             <p>                 Sincerely,             </p>             <p>                 The whatever.com team             </p>         </main>         <footer>             Ph: 000 000 000<br/>             mail: whatever st         </footer>     </body> </html>

...will be transformed into:

Whatever Inc.    Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:   1.  Please confirm this is your email by replying.  2.  Then perform this step.   Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:   *   a point.  *   another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.   Sincerely,   The whatever.com team    Ph: 000 000 000 mail: whatever st

...as opposed to:

        Whatever Inc.               Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:                  Please confirm this is your email by replying.                  Then perform this step.               Please solve this . Then, in any order, could you please:                  a point.                  another point, with a hyperlink.               Sincerely,               The whatever.com team          Ph: 000 000 000         mail: whatever st

179

answered Oct 14 '22 03:10

Brent

Related questions
                            
                                How can I make my code diagnostic syntax node action work on closed files?
                            
                                Why does a recursive call cause StackOverflow at different stack depths?
                            
                                If a "Utilities" class is evil, where do I put my generic code? [closed]
                            
                                How to send an email in .Net according to new security policies?
                            
                                What is the purpose of Decimal.One, Decimal.Zero, Decimal.MinusOne in .Net
                            
                                Is it possible to get a good stack trace with .NET async methods?
                            
                                Visual Studio 2015 - What does the "Analyzers" reference mean?
                            
                                In a C# event handler, why must the "sender" parameter be an object?
                            
                                Using multiple instances of MemoryCache
                            
                                Can I specify a generic type in XAML (pre .NET 4 Framework)?
                            
                                What's the difference between Application.Run() and Form.ShowDialog()?
                            
                                c# Dictionary: making the Key case-insensitive through declarations
                            
                                Post an HTML Table to ADO.NET DataTable
                            
                                Blazor vs Razor
                            
                                c#: difference between "System.Object" and "object"
                            
                                What is the difference between Send Message and Post Message and how these relate to C# ,WPF and Pure windows programming?
                            
                                What is the difference between a click and mouseclick?
                            
                                Create html documentation for C# code [closed]
                            
                                Compilation Error: "The modifier 'public' is not valid for this item" while explicitly implementing the interface
                            
                                The type arguments for method cannot be inferred from the usage

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With