Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect when bytes can't be converted to string in Go?

There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte to string in Go?

like image 609
codefx Avatar asked Jan 18 '16 18:01

codefx


1 Answers

You can, as Tim Cooper noted, test UTF-8 validity with utf8.Valid.

But! You might be thinking that converting non-UTF-8 bytes to a Go string is impossible. In fact, "In Go, a string is in effect a read-only slice of bytes"; it can contain bytes that aren't valid UTF-8 which you can print, access via indexing, pass to WriteString methods, or even round-trip back to a []byte (to Write, say).

There are two places in the language that Go does do UTF-8 decoding of strings for you.

  • when you do for i, r := range s the r is a Unicode code point as a value of type rune
  • when you do the conversion []rune(s), Go decodes the whole string to runes.

(Note that rune is an alias for int32, not a completely different type.)

In both these instances invalid UTF-8 is replaced with U+FFFD, the replacement character reserved for uses like this. More is in the spec sections on for statements and conversions between strings and other types. These conversions never crash, so you only need to actively check for UTF-8 validity if it's relevant to your application, like if you can't accept the U+FFFD replacement and need to throw an error on mis-encoded input.

Since that behavior's baked into the language, you can expect it from libraries, too. U+FFFD is utf8.RuneError and returned by functions in utf8.

Here's a sample program showing what Go does with a []byte holding invalid UTF-8:

package main

import "fmt"

func main() {
    a := []byte{0xff}
    s := string(a)
    fmt.Println(s)
    for _, r := range s {
        fmt.Println(r)
    }
    rs := []rune(s)
    fmt.Println(rs)
}

Output will look different in different environments, but in the Playground it looks like

�
65533
[65533]
like image 177
twotwotwo Avatar answered Sep 21 '22 08:09

twotwotwo