I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.
On the server, I would like to take strip the html from the text so I can store only the text in a Full Text indexed column for searching.
It's a breeze to strip the html on the client using jQuery's text() function, but I would really rather do this on the server. Are there any existing utilities that I can use for this?
See my answer.
alt text http://tinyurl.com/sillychimp
Strip_tags() is a function that allows you to strip out all HTML and PHP tags from a given string (parameter one), however you can also use parameter two to specify a list of HTML tags you want.
If you want to show your content without any formatting then you can use this Regex. Replace(input, "<. *?>", String. Empty) to strip all of Html tags from your string.
I downloaded the HtmlAgilityPack and created this function:
string StripHtml(string html)
{
// create whitespace between html elements, so that words do not run together
html = html.Replace(">","> ");
// parse html
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
// strip html decoded text from html
string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
// replace all whitespace with a single space and remove leading and trailing whitespace
return Regex.Replace(text, @"\s+", " ").Trim();
}
Take a look at this Strip HTML tags from a string using regular expressions
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With