I am reading Go Essentials:
String in Go is an immutable sequence of bytes (8-bit byte values) This is different than languages like Python, C#, Java or Swift where strings are Unicode.
I am playing around with following code:
s := "日本語"
b :=[]byte{0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 0xaa, 0x9e}
fmt.Println(string(b) == s) // true
for i, runeChar := range b {
fmt.Printf("byte position %d: %#U\n", i, runeChar)
}
//byte position 0: U+00E6 'æ'
//byte position 1: U+0097
//byte position 2: U+00A5 '¥'
//byte position 3: U+00E6 'æ'
//byte position 4: U+009C
//byte position 5: U+00AC '¬'
//byte position 6: U+00E8 'è'
//byte position 7: U+00AA 'ª'
//byte position 8: U+009E
for i, runeChar := range string(b) {
fmt.Printf("byte position %d: %#U\n", i, runeChar)
}
//byte position 0: U+65E5 '日'
//byte position 3: U+672C '本'
//byte position 6: U+8A9E '語'
Questions:
From where does Golang get Unicode for encoding byte array when custing to string? How does rune
form? Does Golang compiler get Unicode from text file encoding during compilation?
What are advantages and disadvantages of implementing String like a byte array, instead of utf-16 chars array like in Java?
You are quoting from a weak, unreliable source: Go Essentials: Strings. Amongst other things, there is no mention of Unicode codepoints or UTF-8 encoding.
For example,
package main
import "fmt"
func main() {
s := "日本語"
fmt.Printf("Glyph: %q\n", s)
fmt.Printf("UTF-8: [% x]\n", []byte(s))
fmt.Printf("Unicode codepoint: %U\n", []rune(s))
}
Playground: https://play.golang.org/p/iaYd80Ocitg
Output:
Glyph: "日本語"
UTF-8: [e6 97 a5 e6 9c ac e8 aa 9e]
Unicode codepoint: [U+65E5 U+672C U+8A9E]
References:
The Go Blog: Strings, bytes, runes and characters in Go
The Go Programming Language Specification
Unicode FAQ: UTF-8, UTF-16, UTF-32 & BOM
The Unicode Consortium
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With