Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace all html tag with empty string in golang

Tags:

go

I'm trying to replace all html tag such as <div> </div> ... on empty string ( " " ) in golang with regex pattern ^[^.\/]*$/g to match all close tag. ex : </div>

My solution:

package main

import (
    "fmt"
    "regexp"
)

const Template = `^[^.\/]*$/g`

func main() {
    r := regexp.MustCompile(Template)
    s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    res := r.ReplaceAllString(s, "")
    fmt.Println(res)
}

But output the same source string. What's wrong? Please help. Thank

Expect Result should: "afsdf4534534!@@!!#345345afsdf4534534!@@!!#"

like image 213
Loint Avatar asked Mar 07 '19 04:03

Loint


3 Answers

For those who came here looking for a quick solution, there is a library that does this: bluemonday.

Package bluemonday provides a way of describing a whitelist of HTML elements and attributes as a policy, and for that policy to be applied to untrusted strings from users that may contain markup. All elements and attributes not on the whitelist will be stripped.

package main

import (
    "fmt"

    "github.com/microcosm-cc/bluemonday"
)

func main() {
    // Do this once for each unique policy, and use the policy for the life of the program
    // Policy creation/editing is not safe to use in multiple goroutines
    p := bluemonday.StripTagsPolicy()

    // The policy can then be used to sanitize lots of input and it is safe to use the policy in multiple goroutines
    html := p.Sanitize(
        `<a onblur="alert(secret)" href="http://www.google.com">Google</a>`,
    )

    // Output:
    // Google
    fmt.Println(html)
}

https://play.golang.org/p/jYARzNwPToZ

like image 149
Bill Zelenko Avatar answered Dec 12 '22 14:12

Bill Zelenko


The Problem with RegEx

This is a very simple RegEx replace method that removes HTML tags from well-formatted HTML in a string.

strip_html_regex.go

package main

import "regexp"

const regex = `<.*?>`

// This method uses a regular expresion to remove HTML tags.
func stripHtmlRegex(s string) string {
    r := regexp.MustCompile(regex)
    return r.ReplaceAllString(s, "")
}

Note: this does not work well with malformed HTML. Don't use this.

A better way

Since a string in Go can be treated as a slice of bytes it makes walking through the string and finding portions that are not in an HTML tag easy. When we Identify a valid portion of the string we can simply take a slice of that portion and append it using a strings.Builder.

strip_html.go

package main

import (
    "strings"
    "unicode/utf8"
)

const (
    htmlTagStart = 60 // Unicode `<`
    htmlTagEnd   = 62 // Unicode `>`
)

// Aggressively strips HTML tags from a string.
// It will only keep anything between `>` and `<`.
func stripHtmlTags(s string) string {
    // Setup a string builder and allocate enough memory for the new string.
    var builder strings.Builder
    builder.Grow(len(s) + utf8.UTFMax)

    in := false // True if we are inside an HTML tag.
    start := 0  // The index of the previous start tag character `<`
    end := 0    // The index of the previous end tag character `>`

    for i, c := range s {
        // If this is the last character and we are not in an HTML tag, save it.
        if (i+1) == len(s) && end >= start {
            builder.WriteString(s[end:])
        }

        // Keep going if the character is not `<` or `>`
        if c != htmlTagStart && c != htmlTagEnd {
            continue
        }

        if c == htmlTagStart {
            // Only update the start if we are not in a tag.
            // This make sure we strip out `<<br>` not just `<br>`
            if !in {
                start = i
            }
            in = true

            // Write the valid string between the close and start of the two tags.
            builder.WriteString(s[end:start])
            continue
        }
        // else c == htmlTagEnd
        in = false
        end = i + 1
    }
    s = builder.String()
    return s
}

If we run these two functions with the OP's text and some malformed HTML you will see that the result is not consistent.

main.go

package main

import "fmt"

func main() {
    s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    res := stripHtmlTags(s)
    fmt.Println(res)

    // Malformed HTML examples
    fmt.Println("\n:: stripHTMLTags ::\n")

    fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))
    fmt.Println(stripHtmlTags("h1>I broke this</h1>"))
    fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))
    fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))
    
    // Regex Malformed HTML examples
    fmt.Println(":: stripHtmlRegex ::\n")

    fmt.Println(stripHtmlRegex("Do something <strong>bold</strong>."))
    fmt.Println(stripHtmlRegex("h1>I broke this</h1>"))
    fmt.Println(stripHtmlRegex("This is <a href='#'>>broken link</a>."))
    fmt.Println(stripHtmlRegex("I don't know ><where to <<em>start</em> this tag<."))
}

Output:

afsdf4534534!@@!!#345345afsdf4534534!@@!!#

:: stripHTMLTags ::

Do something bold.
I broke this
This is broken link.
start this tag

:: stripHtmlRegex ::

Do something bold.
h1>I broke this
This is >broken link.
I don't know >start this tag<.

Note: that the RegEx method does not remove all HTML tags consistently. To be honest, I am not good enough at RegEx to write a RegEx match string to properly handle stripping HTML.

Benchmarks

Aside from the advantage of being safer and more aggressive in the stripping of malformed HTML tags stripHtmlTags is about 4 times faster than stripHtmlRegex.

> go test -run=Calculate -bench=.
goos: windows
goarch: amd64
BenchmarkStripHtmlRegex-8          51516             22726 ns/op
BenchmarkStripHtmlTags-8          230678              5135 ns/op
like image 41
Daniel Morell Avatar answered Dec 12 '22 12:12

Daniel Morell


if you want replace all HTML TAG, using strip of html tag.

regex to match HTML tags is not good idea.

package main

import (
    "fmt"
    "github.com/grokify/html-strip-tags-go"
)

func main() {
    text := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    stripped := strip.StripTags(text)

    fmt.Println(text)
    fmt.Println(stripped)
}
like image 29
sh.seo Avatar answered Dec 12 '22 12:12

sh.seo