How can I strip HTML tags from a string in ASP.NET?

People also ask

How do I strip a string in HTML?

To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.

How do I strip a tag in HTML?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped. This cannot be changed with the allow parameter. Note: This function is binary-safe.

What does it mean to strip HTML?

stripHtml( html ) Changes the provided HTML string into a plain text string by converting <br> , <p> , and <div> to line breaks, stripping all other tags, and converting escaped characters into their display values.

If it is just stripping all HTML tags from a string, this works ~~reliably~~ with regex as well. Replace:

<[^>]*(>|$)

with the empty string, globally. Don't forget to normalize the string afterwards, replacing:

[\s\r\n]+

with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.

Note:

There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.

Go download HTMLAgilityPack, now! ;) Download LInk

This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.

Here is a sample:

            string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(htmlContents);
            if (doc == null) return null;

            string output = "";
            foreach (var node in doc.DocumentNode.ChildNodes)
            {
                output += node.InnerText;
            }

Regex.Replace(htmlText, "<.*?>", string.Empty);

protected string StripHtml(string Txt)
{
    return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}    

Protected Function StripHtml(Txt as String) as String
    Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function

I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable. In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:


System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;

Related questions
                            
                                .net Core 2.0 - Package was restored using .NetFramework 4.6.1 instead of target framework .netCore 2.0. The package may not be fully compatible
                            
                                Definition of "==" operator for Double
                            
                                Is it considered acceptable to not call Dispose() on a TPL Task object?
                            
                                Checking if a list is empty with LINQ
                            
                                "Delegate subtraction has unpredictable result" in ReSharper/C#?
                            
                                Performance of Find() vs. FirstOrDefault() [duplicate]
                            
                                What is meant by "managed" vs "unmanaged" resources in .NET?
                            
                                When to use thread pool in C#? [closed]
                            
                                How to wait for a BackgroundWorker to cancel?
                            
                                C# using streams
                            
                                Selecting a row in DataGridView programmatically
                            
                                Unable to load DLL (Module could not be found HRESULT: 0x8007007E)
                            
                                Is there any way to close a StreamWriter without closing its BaseStream?
                            
                                C# vs Java generics [duplicate]
                            
                                Automatically create an Enum based on values in a database lookup table?
                            
                                How to COUNT rows within EntityFramework without loading contents?
                            
                                How to embed a text file in a .NET assembly?
                            
                                ExecuteReader requires an open and available Connection. The connection's current state is Connecting
                            
                                Check if list<t> contains any of another list
                            
                                Correct use of multimapping in Dapper

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I strip HTML tags from a string in ASP.NET?

Tags:

html

string

c#

regex

asp.net

People also ask

Recent Activity

Donate For Us