Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

golang HTML charset decoding

Tags:

go

I'm trying to decode HTML pages that are NOT utf-8 encoded.

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

Is there any library that can do that? I couldn't find one online.

P.S Of course, I can extract charset and decode the HTML page with goquery and iconv-go, but I'm trying not to re-invent the wheels.

like image 881
daisy Avatar asked Mar 25 '26 13:03

daisy


1 Answers

Golang officially provides the extension packages: charset and encoding.

The code below makes sure the document can be parsed correctly by the HTML package:

func detectContentCharset(body io.Reader) string {
    r := bufio.NewReader(body)
    if data, err := r.Peek(1024); err == nil {
        if _, name, ok := charset.DetermineEncoding(data, ""); ok {
            return name
        }
    }
    return "utf-8"
}

// Decode parses the HTML body on the specified encoding and
// returns the HTML Document.
func Decode(body io.Reader, charset string) (interface{}, error) {
    if charset == "" {
        charset = detectContentCharset(body)
    }
    e, err := htmlindex.Get(charset)
    if err != nil {
        return nil, err
    }

    if name, _ := htmlindex.Name(e); name != "utf-8" {
        body = e.NewDecoder().Reader(body)
    }

    node, err := html.Parse(body)
    if err != nil {
        return nil, err
    }
    return node, nil
}
like image 148
zhengchun Avatar answered Mar 28 '26 19:03

zhengchun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!