Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Style tag in HTML

Tags:

html

c#

regex

I need to remove all style tags completely for the given HTML code. I found following regex to match entire style tag in the the XML. It works fine for the given Html code in online regex testers.

*style\s*=\s*('|")[^\2]*?\2([^>]*)*

However, through a C# code, it didn't work for the given HTML.

Following is the C# code:

Regex regex = new Regex("style\\s*=\\s*('|\")[^\\2]*?\\2([^>]*)", RegexOptions.IgnoreCase);
like image 876
Dimax Avatar asked Dec 20 '22 23:12

Dimax


2 Answers

Regex should be

 style\s*=\s*('|")[^\1]*\1

Though I would use Htmlagilitypack

   HtmlDocument doc = new HtmlDocument();
   doc.Load(yourStream);
   var elementsWithStyleAttribute = doc.DocumentNode.SelectNodes("//@style");
   foreach (var element in elementsWithStyleAttribute)
   {
       element.Attributes["style"].Remove();
   }
   doc.Save();
like image 188
Anirudha Avatar answered Dec 31 '22 02:12

Anirudha


I usually use the below code to remove inline styles, class, images and comments from an Outlook message prior to saving it into database:

desc = Regex.Replace(desc, "(<style.+?</style>)|(<script.+?</script>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
desc = Regex.Replace(desc, "(<img.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
desc = Regex.Replace(desc, "(<o:.+?</o:.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
desc = Regex.Replace(desc, "<!--.+?-->", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
desc = Regex.Replace(desc, "class=.+?>", ">", RegexOptions.IgnoreCase | RegexOptions.Singleline);
desc = Regex.Replace(desc, "class=.+?\s", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline);
like image 27
ZooZ Avatar answered Dec 31 '22 03:12

ZooZ