Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove all tags and get the pure text?

I had to store the user input text in my database with HTML and CSS formats.

The case is:

RadEditor ,The user copy the text from MSWord to this editor then i store this text in the database with that format . then when retrieve the data in the report or some label some tags appear wrapping the text !!

I use regular expression to remove all the formats but in vain it succeeds sometimes and not all the time .

private static Regex oClearHtmlScript = new Regex(@"<(.|\n)*?>", RegexOptions.Compiled);

        public static string RemoveAllHTMLTags(string sHtml)
        {

            sHtml = sHtml.Replace("&nbsp;", string.Empty);
            sHtml = sHtml.Replace("&gt;", ">");
            sHtml = sHtml.Replace("&lt;", "<");
            sHtml = sHtml.Replace("&amp;", "&");
            if (string.IsNullOrEmpty(sHtml))
                return string.Empty;

            return oClearHtmlScript.Replace(sHtml, string.Empty);
        }

I ask How to remove all the format using HTMLAgility or any dependable way to ensure the text is pure ?

Note:The datatype of this field in the database is Lvarchar

like image 345
Anyname Donotcare Avatar asked Jan 13 '23 12:01

Anyname Donotcare


2 Answers

This should strip out all html tags from a string.

sHtml = Regex.Replace(sHtml, "<.*?>", "");
like image 163
Win Avatar answered Jan 16 '23 02:01

Win


HtmlAgility pack makes working with HTML easy.

HtmlDocument mainDoc = new HtmlDocument();
string htmlString = "<html><body><h1>Test</h1> more text</body></html>"
mainDoc.LoadHtml(htmlString);
string cleanText = mainDoc.DocumentNode.InnerText;
like image 39
Ty Petrice Avatar answered Jan 16 '23 00:01

Ty Petrice