Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert any encoding to UTF 8 in Go

I'm downloading messages via IMAP. Next I'm adding parsed message into MongoDB. And I've a problem, because MongoDB support only UTF 8. And I wanna convert any encoding to UTF 8. Codes are various. How can I convert each string to UTF 8?

I know, that I can convert to binary, but I have to have normal text, because I have to searching phrases in database. Unless, can I searching normal text in binary?

like image 524
JakubKubera Avatar asked Dec 04 '14 15:12

JakubKubera


People also ask

Does Golang use UTF-8?

Different programming languages have their own character encoding scheme. For example, Java natively uses UTF-16 to map between sequences of sixteen-bit UTF-16 code units and sequences of bytes. Go, on the other hand, uses UTF-8. Both of these are multibyte character encodings.

Are Go strings UTF-8?

In short, Go source code is UTF-8, so the source code for the string literal is UTF-8 text.

Which encoding is used for go code?

Go code is UTF-8 encoded.

What is the default encoding for Golang?

For Go, UTF-8 is the default encoding for storing characters in a string.


2 Answers

I'm using the go-charset project to do this: https://code.google.com/p/go-charset/

It's pretty straightforward, you create a reader from a charset and it translates to utf-8 automatically. example from the library:

r, err := charset.NewReader(strings.NewReader("\xa35 for Pepp\xe9"), "latin1")
if err != nil {
    log.Fatal(err)
}
result, err := ioutil.ReadAll(r)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("%s\n", result)  //outputs £5 for Peppé

Now, in my case I know the charset because it comes from web pages and I read the headers/meta tags. If you need to detect the charset automatically by heuristics, you'll need another library for that, such as this one: https://github.com/saintfish/chardet

I haven't used it but it also looks pretty simple to use:

detector := chardet.NewTextDetector()
result, err := detector.DetectBest(some_text)
if err == nil {
    fmt.Printf(
        "Detected charset is %s, language is %s",
        result.Charset,
        result.Language)
}
like image 170
Not_a_Golfer Avatar answered Sep 21 '22 17:09

Not_a_Golfer


charset.NewReader in package golang.org/x/net/html/charset can't deal with encoding gb2312. charset.NewReaderLabel can deal with it.

import  (
    "io/ioutil"
    "golang.org/x/net/html/charset"
)

func convrtToUTF8(str string, origEncoding string) string {
    strBytes := []byte(str)
    byteReader := bytes.NewReader(strBytes)
    reader, _ := charset.NewReaderLabel(origEncoding, byteReader)
    strBytes, _ = ioutil.ReadAll(reader)
    return string(strBytes)
}
like image 36
Liu Zhihui Avatar answered Sep 22 '22 17:09

Liu Zhihui