How to detect when bytes can't be converted to string in Go?

Question

There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte to string in Go?

twotwotwo · Accepted Answer

You can, as Tim Cooper noted, test UTF-8 validity with utf8.Valid.

But! You might be thinking that converting non-UTF-8 bytes to a Go string is impossible. In fact, "In Go, a string is in effect a read-only slice of bytes"; it can contain bytes that aren't valid UTF-8 which you can print, access via indexing, pass to WriteString methods, or even round-trip back to a []byte (to Write, say).

There are two places in the language that Go does do UTF-8 decoding of strings for you.

when you do for i, r := range s the r is a Unicode code point as a value of type rune
when you do the conversion []rune(s), Go decodes the whole string to runes.

(Note that rune is an alias for int32, not a completely different type.)

In both these instances invalid UTF-8 is replaced with U+FFFD, the replacement character reserved for uses like this. More is in the spec sections on for statements and conversions between strings and other types. These conversions never crash, so you only need to actively check for UTF-8 validity if it's relevant to your application, like if you can't accept the U+FFFD replacement and need to throw an error on mis-encoded input.

Since that behavior's baked into the language, you can expect it from libraries, too. U+FFFD is utf8.RuneError and returned by functions in utf8.

Here's a sample program showing what Go does with a []byte holding invalid UTF-8:

package main

import "fmt"

func main() {
    a := []byte{0xff}
    s := string(a)
    fmt.Println(s)
    for _, r := range s {
        fmt.Println(r)
    }
    rs := []rune(s)
    fmt.Println(rs)
}

Output will look different in different environments, but in the Playground it looks like

�
65533
[65533]

How to detect when bytes can't be converted to string in Go?

Tags:

string

encoding

unicode

utf-8

go

codefx

1 Answers

twotwotwo

Recent Activity

Donate For Us

How to detect when bytes can't be converted to string in Go?

Tags:

string

encoding

unicode

utf-8

go

codefx

1 Answers

twotwotwo

Related questions

Recent Activity

Donate For Us