How do you convert Html to plain text?

People also ask

How do I convert HTML content to plain text?

The easiest way would be to strip all the HTML tags using the replace() method of JavaScript. It finds all tags enclosed in angle brackets and replaces them with a space. var text = html.

How do I display HTML as plain text?

You can show HTML tags as plain text in HTML on a website or webpage by replacing < with < or &60; and > with > or &62; on each HTML tag that you want to be visible. Ordinarily, HTML tags are not visible to the reader on the browser.

How do I convert to plain text?

In a Windows Microsoft Word document, click the Save As button from the File menu. Select Save As Type from the drop-down list then select Plain Text (*. txt). Click the Save button and a File Conversion window will open.

The MIT licensed HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text.

var plainText = HtmlUtilities.ConvertToPlainText(string html);

Feed it an HTML string like

<b>hello, <i>world!</i></b>

And you'll get a plain text result like:

hello world!

I could not use HtmlAgilityPack, so I wrote a second best solution for myself

private static string HtmlToPlainText(string html)
{
    const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
    const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
    const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
    var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
    var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
    var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

    var text = html;
    //Decode html specific characters
    text = System.Net.WebUtility.HtmlDecode(text); 
    //Remove tag whitespace/line breaks
    text = tagWhiteSpaceRegex.Replace(text, "><");
    //Replace <br /> with line breaks
    text = lineBreakRegex.Replace(text, Environment.NewLine);
    //Strip formatting
    text = stripFormattingRegex.Replace(text, string.Empty);

    return text;
}

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

If you can use regular expressions there are many web pages out there with good info:

http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx
http://www.google.com/search?hl=en&q=html+tag+stripping+&btnG=Search

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

HTTPUtility.HTMLEncode() is meant to handle encoding HTML tags as strings. It takes care of all the heavy lifting for you. From the MSDN Documentation:

If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as < and > for HTTP transmission.

HTTPUtility.HTMLEncode() method, detailed here:

public static void HtmlEncode(
  string s,
  TextWriter output
)

Usage:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

Three Step Process for converting HTML into Plain Text

First You need to Install Nuget Package For HtmlAgilityPack Second Create This class

public class HtmlToText
{
    public HtmlToText()
    {
    }

    public string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    private void ConvertContentTo(HtmlNode node, TextWriter outText)
    {
        foreach(HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText);
        }
    }

    public void ConvertTo(HtmlNode node, TextWriter outText)
    {
        string html;
        switch(node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;

            case HtmlNodeType.Document:
                ConvertContentTo(node, outText);
                break;

            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;

                // get text
                html = ((HtmlTextNode)node).Text;

                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;

                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html));
                }
                break;

            case HtmlNodeType.Element:
                switch(node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }

                if (node.HasChildNodes)
                {
                    ConvertContentTo(node, outText);
                }
                break;
        }
    }
}

By using above class with reference to Judah Himango's answer

Third you need to create the Object of above class and Use ConvertHtml(HTMLContent) Method for converting HTML into Plain Text rather than ConvertToPlainText(string html);

HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);

Related questions
                            
                                How do I read an attribute on a class at runtime?
                            
                                How to remove the focus from a TextBox in WinForms?
                            
                                Dynamic array in C#
                            
                                How to put a new line into a wpf TextBlock control?
                            
                                C# delete a folder and all files and folders within that folder
                            
                                Failed to serialize the response in Web API with Json
                            
                                Is there a method to find the max of 3 numbers in C#?
                            
                                A type for Date only in C# - why is there no Date type?
                            
                                How do you include Xml Docs for a class library in a NuGet package?
                            
                                Best way to resolve file path too long exception
                            
                                Why are we not to throw these exceptions?
                            
                                Can a C# class inherit attributes from its interface?
                            
                                _=> what does this underscore mean in Lambda expressions?
                            
                                How to declare a friend assembly?
                            
                                Namespace and class with the same name?
                            
                                Why doesn't .NET/C# optimize for tail-call recursion?
                            
                                What happens if i return before the end of using statement? Will the dispose be called?
                            
                                Best Practice for Exception Handling in a Windows Forms Application?
                            
                                The server committed a protocol violation. Section=ResponseStatusLine ERROR
                            
                                C# equivalent of the IsNull() function in SQL Server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you convert Html to plain text?

Tags:

html

c#

asp.net

People also ask

Recent Activity

Donate For Us