Stripping all html tags with Html Agility Pack

Tags:

I have a html string like this:

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

I wish to strip all html tags so that the resulting string becomes:

foo bar baz

From another post here at SO I've come up with this function (which uses the Html Agility Pack):

  Public Shared Function stripTags(ByVal html As String) As String
    Dim plain As String = String.Empty
    Dim htmldoc As New HtmlAgilityPack.HtmlDocument

    htmldoc.LoadHtml(html)
    Dim invalidNodes As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("//html|//body|//p|//a")

    If Not htmldoc Is Nothing Then
      For Each node In invalidNodes
        node.ParentNode.RemoveChild(node, True)
      Next
    End If

    Return htmldoc.DocumentNode.WriteContentTo
  End Function

Unfortunately this does not return what I expect, instead it gives:

bazbarfoo

Please, where do I go wrong - and is this the best approach?

Regards and happy coding!

UPDATE: by the answer below I came up with this function, might be usefull to others:

  Public Shared Function stripTags(ByVal html As String) As String
    Dim htmldoc As New HtmlAgilityPack.HtmlDocument
    htmldoc.LoadHtml(html.Replace("</p>", "</p>" & New String(Environment.NewLine, 2)).Replace("<br/>", Environment.NewLine))
    Return htmldoc.DocumentNode.InnerText
  End Function

932

asked Jun 29 '10 13:06

Muleskinner

2 Answers

Why not just return htmldoc.DocumentNode.InnerText instead of removing all the non-text nodes? It should give you what you want.

184

answered Oct 08 '22 02:10

tvanfosson

It removes the tags and properties not found in the whitelist.

Public NotInheritable Class HtmlSanitizer
    Private Sub New()
    End Sub
    Private Shared ReadOnly Whitelist As IDictionary(Of String, String())
    Private Shared DeletableNodesXpath As New List(Of String)()

    Shared Sub New()
        Whitelist = New Dictionary(Of String, String())() From { _
            {"a", New () {"href"}}, _
            {"strong", Nothing}, _
            {"em", Nothing}, _
            {"blockquote", Nothing}, _
            {"b", Nothing}, _
            {"p", Nothing}, _
            {"ul", Nothing}, _
            {"ol", Nothing}, _
            {"li", Nothing}, _
            {"div", New () {"align"}}, _
            {"strike", Nothing}, _
            {"u", Nothing}, _
            {"sub", Nothing}, _
            {"sup", Nothing}, _
            {"table", Nothing}, _
            {"tr", Nothing}, _
            {"td", Nothing}, _
            {"th", Nothing} _
        }
    End Sub

    Public Shared Function Sanitize(input As String) As String
        If input.Trim().Length < 1 Then
            Return String.Empty
        End If
        Dim htmlDocument = New HtmlDocument()

        htmlDocument.LoadHtml(input)
        SanitizeNode(htmlDocument.DocumentNode)
        Dim xPath As String = HtmlSanitizer.CreateXPath()

        Return StripHtml(htmlDocument.DocumentNode.WriteTo().Trim(), xPath)
    End Function

    Private Shared Sub SanitizeChildren(parentNode As HtmlNode)
        For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1
            SanitizeNode(parentNode.ChildNodes(i))
        Next
    End Sub

    Private Shared Sub SanitizeNode(node As HtmlNode)
        If node.NodeType = HtmlNodeType.Element Then
            If Not Whitelist.ContainsKey(node.Name) Then
                If Not DeletableNodesXpath.Contains(node.Name) Then
                    'DeletableNodesXpath.Add(node.Name.Replace("?",""));
                    node.Name = "removeableNode"
                    DeletableNodesXpath.Add(node.Name)
                End If
                If node.HasChildNodes Then
                    SanitizeChildren(node)
                End If

                Return
            End If

            If node.HasAttributes Then
                For i As Integer = node.Attributes.Count - 1 To 0 Step -1
                    Dim currentAttribute As HtmlAttribute = node.Attributes(i)
                    Dim allowedAttributes As String() = Whitelist(node.Name)
                    If allowedAttributes IsNot Nothing Then
                        If Not allowedAttributes.Contains(currentAttribute.Name) Then
                            node.Attributes.Remove(currentAttribute)
                        End If
                    Else
                        node.Attributes.Remove(currentAttribute)
                    End If
                Next
            End If
        End If

        If node.HasChildNodes Then
            SanitizeChildren(node)
        End If
    End Sub

    Private Shared Function StripHtml(html As String, xPath As String) As String
        Dim htmlDoc As New HtmlDocument()
        htmlDoc.LoadHtml(html)
        If xPath.Length > 0 Then
            Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
            For Each node As HtmlNode In invalidNodes
                node.ParentNode.RemoveChild(node, True)
            Next
        End If
        Return htmlDoc.DocumentNode.WriteContentTo()


    End Function

    Private Shared Function CreateXPath() As String
        Dim _xPath As String = String.Empty
        For i As Integer = 0 To DeletableNodesXpath.Count - 1
            If i IsNot DeletableNodesXpath.Count - 1 Then
                _xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString())
            Else
                _xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString())
            End If
        Next
        Return _xPath
    End Function
End Class

answered Oct 08 '22 04:10

Dragos Durlut

Related questions
                            
                                Is it too early to use HTML5
                            
                                Including hidden data in an HTML page for javascript to process
                            
                                HTML5 <script> declarations
                            
                                Html5 drag and drop on svg element
                            
                                setting minimum width of div
                            
                                Video file .ogv plays locally in Firefox, but not from server
                            
                                HTML5, how to force input pattern validation on value change?
                            
                                Show a second dropdown based on previous dropdown selection
                            
                                how to append an element between two elements
                            
                                How to convert HTML with mathjax into Latex using Pandoc?
                            
                                Html over the Canvas?
                            
                                Disable/Non-Clickable an HTML button in Javascript
                            
                                AppendChild() is not a function javascript
                            
                                Select text in ::before or ::after pseudo-element
                            
                                Understand the Find() function in Beautiful Soup
                            
                                Scrape title by only downloading relevant part of webpage
                            
                                What is the difference between margin-block-start and margin-top?
                            
                                Simple HTML layout engine to convert HTML to an image
                            
                                Capturing HTML generated from ASP.NET
                            
                                How to apply 100% height to div?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Stripping all html tags with Html Agility Pack

Tags:

html

vb.net

strip

html-agility-pack

Muleskinner

People also ask

2 Answers

tvanfosson

Dragos Durlut

Recent Activity

Donate For Us