I've seen a few related questions out here, but they don’t exactly talk about the same problem I am facing.
I want to use the HTML Agility Pack to remove unwanted tags from my HTML without losing the content within the tags.
So for instance, in my scenario, I would like to preserve the tags "b
", "i
" and "u
".
And for an input like:
<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>
The resulting HTML should be:
my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>
I tried using HtmlNode
's Remove
method, but it removes my content too. Any suggestions?
The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.
For users who are unafamiliar with “HTML Agility Pack“, this is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. In simple words, it is a . NET code library that allows you to parse “out of the web” files (be it HTML, PHP or aspx).
Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a . NET code library that allows you to parse "out of the web" HTML files.
I wrote an algorithm based on Oded's suggestions. Here it is. Works like a charm.
It removes all tags except strong
, em
, u
and raw text nodes.
internal static string RemoveUnwantedTags(string data) { if(string.IsNullOrEmpty(data)) return string.Empty; var document = new HtmlDocument(); document.LoadHtml(data); var acceptableTags = new String[] { "strong", "em", "u"}; var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()")); while(nodes.Count > 0) { var node = nodes.Dequeue(); var parentNode = node.ParentNode; if(!acceptableTags.Contains(node.Name) && node.Name != "#text") { var childNodes = node.SelectNodes("./*|./text()"); if (childNodes != null) { foreach (var child in childNodes) { nodes.Enqueue(child); parentNode.InsertBefore(child, node); } } parentNode.RemoveChild(node); } } return document.DocumentNode.InnerHtml; }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With