It's really easy to decode a <code>[]byte</code> into a <code>[]rune</code> (simply cast to <code>string</code>, then cast to <code>[]rune</code> works very nicely, I'm assuming it defaults to utf8 and with filler bytes for invalids). My question is - how are you suppose to decode this <code>[]rune</code> back to <code>[]byte</code> in utf8 form? Am I missing something or do I have manually call EncodeRune for every single rune in my <code>[]rune</code>? Surely there is an encoder that I can simply pass a <code>Writer</code> to.

You can simply convert a rune slice (<code>[]rune</code>) to <code>string</code> which you can convert back to <code>[]byte</code>. Example: <pre class="prettyprint"><code>rs := []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'} bs := []byte(string(rs)) fmt.Printf("%s\n", bs) fmt.Println(string(bs)) </code></pre> Output (try it on the Go Playground): <pre class="prettyprint"><code>Hello 世界 Hello 世界 </code></pre> The Go Specification: Conversions mentions this case explicitly: Conversions to and from a string type, point #3: <blockquote> Converting a slice of runes to a string type yields a string that is the concatenation of the individual rune values converted to strings. </blockquote> Note that the above solution–although may be the simplest–might not be the most efficient. And the reason is because it first creates a <code>string</code> value that will hold a "copy" of the runes in UTF-8 encoded form, then it copies the backing slice of the string to the result byte slice (a copy has to be made because <code>string</code> values are immutable, and if the result slice would share data with the <code>string</code>, we would be able to modify the content of the <code>string</code>; for details, see golang: []byte(string) vs []byte(*string) and Immutable string and pointer address). Note that a smart compiler could detect that the intermediate <code>string</code> value cannot be referred to and thus eliminate one of the copies. We may get better performance by allocating a single byte slice, and encode the runes one-by-one into it. And we're done. To easily do this, we may call the <code>unicode/utf8</code> package to our aid: <pre class="prettyprint"><code>rs := []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'} bs := make([]byte, len(rs)*utf8.UTFMax) count := 0 for _, r := range rs { count += utf8.EncodeRune(bs[count:], r) } bs = bs[:count] fmt.Printf("%s\n", bs) fmt.Println(string(bs)) </code></pre> Output of the above is the same. Try it on the Go Playground. Note that in order to create the result slice, we had to guess how big the result slice will be. We used a maximum estimation, which is the number of runes multiplied by the max number of bytes a rune may be encoded to (<code>utf8.UTFMax</code>). In most cases, this will be bigger than needed. We may create a third version where we first calculate the exact size needed. For this, we may use the <code>utf8.RuneLen()</code> function. The gain will be that we will not "waste" memory, and we won't have to do a final slicing (<code>bs = bs[:count]</code>). Let's compare the performances. The 3 functions (3 versions) to compare: <pre class="prettyprint"><code>func runesToUTF8(rs []rune) []byte { return []byte(string(rs)) } func runesToUTF8Manual(rs []rune) []byte { bs := make([]byte, len(rs)*utf8.UTFMax) count := 0 for _, r := range rs { count += utf8.EncodeRune(bs[count:], r) } return bs[:count] } func runesToUTF8Manual2(rs []rune) []byte { size := 0 for _, r := range rs { size += utf8.RuneLen(r) } bs := make([]byte, size) count := 0 for _, r := range rs { count += utf8.EncodeRune(bs[count:], r) } return bs } </code></pre> And the benchmarking code: <pre class="prettyprint"><code>var rs = []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'} func BenchmarkFirst(b *testing.B) { for i := 0; i < b.N; i++ { runesToUTF8(rs) } } func BenchmarkSecond(b *testing.B) { for i := 0; i < b.N; i++ { runesToUTF8Manual(rs) } } func BenchmarkThird(b *testing.B) { for i := 0; i < b.N; i++ { runesToUTF8Manual2(rs) } } </code></pre> And the results: <pre class="prettyprint"><code>BenchmarkFirst-4 20000000 95.8 ns/op BenchmarkSecond-4 20000000 84.4 ns/op BenchmarkThird-4 20000000 81.2 ns/op </code></pre> As suspected, the second version is faster and the third version is the fastest, although the performance gain is not huge. In general the first, simplest solution is preferred, but if this is in some critical part of your app (and is executed many-many times), the third version might worth it to be used.

How encode []rune into []byte using utf8

Tags:

It's really easy to decode a []byte into a []rune (simply cast to string, then cast to []rune works very nicely, I'm assuming it defaults to utf8 and with filler bytes for invalids). My question is - how are you suppose to decode this []rune back to []byte in utf8 form?

Am I missing something or do I have manually call EncodeRune for every single rune in my []rune? Surely there is an encoder that I can simply pass a Writer to.

598

asked Mar 25 '15 12:03

dpington

1 Answers

You can simply convert a rune slice ([]rune) to string which you can convert back to []byte.

Example:

rs := []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'}
bs := []byte(string(rs))

fmt.Printf("%s\n", bs)
fmt.Println(string(bs))

Output (try it on the Go Playground):

Hello 世界
Hello 世界

The Go Specification: Conversions mentions this case explicitly: Conversions to and from a string type, point #3:

Converting a slice of runes to a string type yields a string that is the concatenation of the individual rune values converted to strings.

Note that the above solution–although may be the simplest–might not be the most efficient. And the reason is because it first creates a string value that will hold a "copy" of the runes in UTF-8 encoded form, then it copies the backing slice of the string to the result byte slice (a copy has to be made because string values are immutable, and if the result slice would share data with the string, we would be able to modify the content of the string; for details, see golang: []byte(string) vs []byte(*string) and Immutable string and pointer address).

^{Note that a smart compiler could detect that the intermediate string value cannot be referred to and thus eliminate one of the copies.}

We may get better performance by allocating a single byte slice, and encode the runes one-by-one into it. And we're done. To easily do this, we may call the unicode/utf8 package to our aid:

rs := []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'}
bs := make([]byte, len(rs)*utf8.UTFMax)

count := 0
for _, r := range rs {
    count += utf8.EncodeRune(bs[count:], r)
}
bs = bs[:count]

fmt.Printf("%s\n", bs)
fmt.Println(string(bs))

Output of the above is the same. Try it on the Go Playground.

Note that in order to create the result slice, we had to guess how big the result slice will be. We used a maximum estimation, which is the number of runes multiplied by the max number of bytes a rune may be encoded to (utf8.UTFMax). In most cases, this will be bigger than needed.

We may create a third version where we first calculate the exact size needed. For this, we may use the utf8.RuneLen() function. The gain will be that we will not "waste" memory, and we won't have to do a final slicing (bs = bs[:count]).

Let's compare the performances. The 3 functions (3 versions) to compare:

func runesToUTF8(rs []rune) []byte {
    return []byte(string(rs))
}

func runesToUTF8Manual(rs []rune) []byte {
    bs := make([]byte, len(rs)*utf8.UTFMax)

    count := 0
    for _, r := range rs {
        count += utf8.EncodeRune(bs[count:], r)
    }

    return bs[:count]
}

func runesToUTF8Manual2(rs []rune) []byte {
    size := 0
    for _, r := range rs {
        size += utf8.RuneLen(r)
    }

    bs := make([]byte, size)

    count := 0
    for _, r := range rs {
        count += utf8.EncodeRune(bs[count:], r)
    }

    return bs
}

And the benchmarking code:

var rs = []rune{'H', 'e', 'l', 'l', 'o', ' ', '世', '界'}

func BenchmarkFirst(b *testing.B) {
    for i := 0; i < b.N; i++ {
        runesToUTF8(rs)
    }
}

func BenchmarkSecond(b *testing.B) {
    for i := 0; i < b.N; i++ {
        runesToUTF8Manual(rs)
    }
}

func BenchmarkThird(b *testing.B) {
    for i := 0; i < b.N; i++ {
        runesToUTF8Manual2(rs)
    }
}

And the results:

BenchmarkFirst-4        20000000                95.8 ns/op
BenchmarkSecond-4       20000000                84.4 ns/op
BenchmarkThird-4        20000000                81.2 ns/op

As suspected, the second version is faster and the third version is the fastest, although the performance gain is not huge. In general the first, simplest solution is preferred, but if this is in some critical part of your app (and is executed many-many times), the third version might worth it to be used.

118

answered Oct 19 '22 08:10

icza

Related questions
                            
                                What does sys.stdin read?
                            
                                Open 'ipython notebook' as: IPython notebook vs Jupyter
                            
                                Calculating shadow values for all Material Design elevations
                            
                                What does "Failed parsing 'srcset' attribute value since its 'w' descriptor is invalid." mean?
                            
                                How to get the row count of a table instantly in DynamoDB?
                            
                                What are the possible states for a docker container?
                            
                                java.lang.ClassNotFoundException: ch.qos.logback.classic.spi.ThrowableProxy?
                            
                                Accessing the host app code from the Xcode 7 UI Test target
                            
                                python - Using pandas structures with large csv(iterate and chunksize)
                            
                                Forcing cURL to get a password from the environment
                            
                                Can we use VectorDrawable or VectorXML as icons for push notifications in android?
                            
                                Create ArrayBuffer from Array (holding integers) and back again

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With