Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML agility pack - removing unwanted tags without removing content?

I've seen a few related questions out here, but they don’t exactly talk about the same problem I am facing.

I want to use the HTML Agility Pack to remove unwanted tags from my HTML without losing the content within the tags.

So for instance, in my scenario, I would like to preserve the tags "b", "i" and "u".

And for an input like:

<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>

The resulting HTML should be:

my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>

I tried using HtmlNode's Remove method, but it removes my content too. Any suggestions?

like image 750
Mathias Lykkegaard Lorenzen Avatar asked Oct 08 '12 18:10

Mathias Lykkegaard Lorenzen


People also ask

How do you remove tags in HTML?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.

What is HTML agility pack?

For users who are unafamiliar with “HTML Agility Pack“, this is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. In simple words, it is a . NET code library that allows you to parse “out of the web” files (be it HTML, PHP or aspx).

Is HTML agility pack free?

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a . NET code library that allows you to parse "out of the web" HTML files.


1 Answers

I wrote an algorithm based on Oded's suggestions. Here it is. Works like a charm.

It removes all tags except strong, em, u and raw text nodes.

internal static string RemoveUnwantedTags(string data) {     if(string.IsNullOrEmpty(data)) return string.Empty;      var document = new HtmlDocument();     document.LoadHtml(data);      var acceptableTags = new String[] { "strong", "em", "u"};      var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()"));     while(nodes.Count > 0)     {         var node = nodes.Dequeue();         var parentNode = node.ParentNode;          if(!acceptableTags.Contains(node.Name) && node.Name != "#text")         {             var childNodes = node.SelectNodes("./*|./text()");              if (childNodes != null)             {                 foreach (var child in childNodes)                 {                     nodes.Enqueue(child);                     parentNode.InsertBefore(child, node);                 }             }              parentNode.RemoveChild(node);          }     }      return document.DocumentNode.InnerHtml; } 
like image 173
Mathias Lykkegaard Lorenzen Avatar answered Oct 11 '22 19:10

Mathias Lykkegaard Lorenzen