Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Can I strip HTML from Text in .NET?

I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.

On the server, I would like to take strip the html from the text so I can store only the text in a Full Text indexed column for searching.

It's a breeze to strip the html on the client using jQuery's text() function, but I would really rather do this on the server. Are there any existing utilities that I can use for this?

EDIT

See my answer.

EDIT 2

alt text http://tinyurl.com/sillychimp

like image 739
Ronnie Overby Avatar asked Aug 28 '09 19:08

Ronnie Overby


People also ask

Is it possible to remove the HTML tags from data?

Strip_tags() is a function that allows you to strip out all HTML and PHP tags from a given string (parameter one), however you can also use parameter two to specify a list of HTML tags you want.

How remove HTML tag from string in Cshtml?

If you want to show your content without any formatting then you can use this Regex. Replace(input, "<. *?>", String. Empty) to strip all of Html tags from your string.


2 Answers

I downloaded the HtmlAgilityPack and created this function:

string StripHtml(string html)
{
    // create whitespace between html elements, so that words do not run together
    html = html.Replace(">","> ");

    // parse html
    var doc = new HtmlAgilityPack.HtmlDocument();   
    doc.LoadHtml(html);

    // strip html decoded text from html
    string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);   

    // replace all whitespace with a single space and remove leading and trailing whitespace
    return Regex.Replace(text, @"\s+", " ").Trim();
}
like image 138
Ronnie Overby Avatar answered Oct 05 '22 23:10

Ronnie Overby


Take a look at this Strip HTML tags from a string using regular expressions

like image 33
riotera Avatar answered Oct 05 '22 23:10

riotera