There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte
to string
in Go?
You can, as Tim Cooper noted, test UTF-8 validity with utf8.Valid
.
But! You might be thinking that converting non-UTF-8 bytes to a Go string
is impossible. In fact, "In Go, a string is in effect a read-only slice of bytes"; it can contain bytes that aren't valid UTF-8 which you can print, access via indexing, pass to WriteString
methods, or even round-trip back to a []byte
(to Write
, say).
There are two places in the language that Go does do UTF-8 decoding of string
s for you.
for i, r := range s
the r
is a Unicode code point as a value of type rune
[]rune(s)
, Go decodes the whole string to runes.(Note that rune
is an alias for int32
, not a completely different type.)
In both these instances invalid UTF-8 is replaced with U+FFFD
, the replacement character reserved for uses like this. More is in the spec sections on for
statements and conversions between string
s and other types. These conversions never crash, so you only need to actively check for UTF-8 validity if it's relevant to your application, like if you can't accept the U+FFFD replacement and need to throw an error on mis-encoded input.
Since that behavior's baked into the language, you can expect it from libraries, too. U+FFFD
is utf8.RuneError
and returned by functions in utf8
.
Here's a sample program showing what Go does with a []byte
holding invalid UTF-8:
package main
import "fmt"
func main() {
a := []byte{0xff}
s := string(a)
fmt.Println(s)
for _, r := range s {
fmt.Println(r)
}
rs := []rune(s)
fmt.Println(rs)
}
Output will look different in different environments, but in the Playground it looks like
�
65533
[65533]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With