remove all inline styles and (most) classes from an HTML string

Question

I'll start from the end:
In my C# program, I have a string containing HTML, and I'd like to remove from the elements in this string, all inline style attributes (style=".."), and all classes beginning with 'abc'.
I'm willing to use regular expressions for this, even though some people bitch about it :).

(an explanation, for those wishing to berate me for parsing HTML strings:
I'm forced to use some less-than-optimal web control for my project. the control is designed to be used server-side (i.e with postbacks and all that stuff), but I'm required to use it in ajax calls.
which means that I have to configure it in code, call its Render() method which gives me the HTML string, and pass that string to the client-side, where it's inserted into the DOM at the appropriate place. Unfortunately, I wasn't able to find the correct configuration of the control to stop it from rendering itself with these useless styles and classes, so I'm forced to remove them by hand. Please don't hate me.)

Bohemian · Accepted Answer

Try this:

string html;
string cleaned = new Regex("style=\"[^\"]*\"").Replace(html, "");
string cleaned = new Regex("(?<=class=\")([^\"]*)\babc\w*\b([^\"]*)(?=\")").Replace(cleaned, "$1$2");

J. Ed · Answer

To anyone interested- I've solved this without using RegEx;
Rather, I used XDocument to parse the html-

private string MakeHtmlGood(string html)
        {
            var xmlDoc = XDocument.Parse(html);
            // Remove all inline styles
            xmlDoc.Descendants().Attributes("style").Remove();

            // Remove all classes inserted by 3rd party, without removing our own lovely classes
            foreach (var node in xmlDoc.Descendants())
            {
                var classAttribute = node.Attributes("class").SingleOrDefault();
                if (classAttribute == null)
                {
                    continue;
                }
                var classesThatShouldStay = classAttribute.Value.Split(' ').Where(className => !className.StartsWith("abc"));
                classAttribute.SetValue(string.Join(" ", classesThatShouldStay));

            }

            return xmlDoc.ToString();
        }

remove all inline styles and (most) classes from an HTML string

Tags:

html

c#

regex

parsing

html-parsing

J. Ed

2 Answers

Bohemian

J. Ed

Recent Activity

Donate For Us

remove all inline styles and (most) classes from an HTML string

Tags:

html

c#

regex

parsing

html-parsing

J. Ed

2 Answers

Bohemian

J. Ed

Related questions

Recent Activity

Donate For Us