I'm downloading messages via IMAP. Next I'm adding parsed message into MongoDB. And I've a problem, because MongoDB support only UTF 8. And I wanna convert any encoding to UTF 8. Codes are various. How can I convert each string to UTF 8?
I know, that I can convert to binary, but I have to have normal text, because I have to searching phrases in database. Unless, can I searching normal text in binary?
Different programming languages have their own character encoding scheme. For example, Java natively uses UTF-16 to map between sequences of sixteen-bit UTF-16 code units and sequences of bytes. Go, on the other hand, uses UTF-8. Both of these are multibyte character encodings.
In short, Go source code is UTF-8, so the source code for the string literal is UTF-8 text.
Go code is UTF-8 encoded.
For Go, UTF-8 is the default encoding for storing characters in a string.
I'm using the go-charset
project to do this: https://code.google.com/p/go-charset/
It's pretty straightforward, you create a reader from a charset and it translates to utf-8 automatically. example from the library:
r, err := charset.NewReader(strings.NewReader("\xa35 for Pepp\xe9"), "latin1")
if err != nil {
log.Fatal(err)
}
result, err := ioutil.ReadAll(r)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s\n", result) //outputs £5 for Peppé
Now, in my case I know the charset because it comes from web pages and I read the headers/meta tags. If you need to detect the charset automatically by heuristics, you'll need another library for that, such as this one: https://github.com/saintfish/chardet
I haven't used it but it also looks pretty simple to use:
detector := chardet.NewTextDetector()
result, err := detector.DetectBest(some_text)
if err == nil {
fmt.Printf(
"Detected charset is %s, language is %s",
result.Charset,
result.Language)
}
charset.NewReader
in package golang.org/x/net/html/charset
can't deal with encoding gb2312
. charset.NewReaderLabel
can deal with it.
import (
"io/ioutil"
"golang.org/x/net/html/charset"
)
func convrtToUTF8(str string, origEncoding string) string {
strBytes := []byte(str)
byteReader := bytes.NewReader(strBytes)
reader, _ := charset.NewReaderLabel(origEncoding, byteReader)
strBytes, _ = ioutil.ReadAll(reader)
return string(strBytes)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With