According to https://blog.golang.org/strings and my testings, it looks like while we <code>range</code> a string, the characters we get are <code>rune</code> type, but if we get it by <code>str[index]</code>, they will be <code>byte</code> type, why is it?

To the first level, the why is because that's how the language is defined. The String type tells us that: <blockquote> A string value is a (possibly empty) sequence of bytes. The number of bytes is called the length of the string and is never negative. Strings are immutable: once created, it is impossible to change the contents of a string. </blockquote> and: <blockquote> A string's bytes can be accessed by integer indices 0 through len(s)-1. </blockquote> Meanwhile, <code>range</code> is a clause you can insert into a <code>for</code> statement, and the specification says: <blockquote> The expression on the right in the "range" clause is called the range expression, which may be ... [a] string ... </blockquote> and: <blockquote> <ol start="2"> <li>For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type <code>rune</code>, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be <code>0xFFFD</code>, the Unicode replacement character, and the next iteration will advance a single byte in the string.</li> </ol> </blockquote> If you want to know why the language is defined that way, you really have to ask the definers themselves. However, note that if <code>for</code> ranged only over the bytes, you'd need to construct your own fancier loops to range over the runes. Given that <code>for ... range</code> does work through the runes, if you want to work through the bytes in string <code>s</code> instead, you can write: <pre class="prettyprint"><code>for i := 0; i < len(s); i++ { ... } </code></pre> and easily access <code>s[i]</code> inside the loop. You can also write: <pre class="prettyprint"><code>for i, b := range []byte(s) { } </code></pre> and access both index <code>i</code> and byte <code>b</code> inside the loop. (Conversion from string to <code>[]byte</code>, or vice versa, can require a copy since <code>[]byte</code> can be modified. In this case, though, the <code>range</code> does not modify it and the compiler can optimize away the copy. See icza's comment below or this answer to golang: []byte(string) vs []byte(*string).) So you have not lost any ability, just perhaps a smidgen of concision.

Rune vs byte ranging over string

Tags:

go

rune

According to https://blog.golang.org/strings and my testings, it looks like while we range a string, the characters we get are rune type, but if we get it by str[index], they will be byte type, why is it?

534

asked Oct 31 '19 00:10

user8142520

2 Answers

To the first level, the why is because that's how the language is defined. The String type tells us that:

A string value is a (possibly empty) sequence of bytes. The number of bytes is called the length of the string and is never negative. Strings are immutable: once created, it is impossible to change the contents of a string.

and:

A string's bytes can be accessed by integer indices 0 through len(s)-1.

Meanwhile, range is a clause you can insert into a for statement, and the specification says:

The expression on the right in the "range" clause is called the range expression, which may be ... [a] string ...

and:

For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

If you want to know why the language is defined that way, you really have to ask the definers themselves. However, note that if for ranged only over the bytes, you'd need to construct your own fancier loops to range over the runes. Given that for ... range does work through the runes, if you want to work through the bytes in string s instead, you can write:

for i := 0; i < len(s); i++ {
    ...
}

and easily access s[i] inside the loop. You can also write:

for i, b := range []byte(s) {
}

and access both index i and byte b inside the loop. (Conversion from string to []byte, or vice versa, can require a copy since []byte can be modified. In this case, though, the range does not modify it and the compiler can optimize away the copy. See icza's comment below or this answer to golang: []byte(string) vs []byte(*string).) So you have not lost any ability, just perhaps a smidgen of concision.

197

answered Oct 11 '22 15:10

torek

Just a quick and simple answer on why the language is defined this way.

Think about what a rune is. A rune represents a Unicode code point, which can be composed of multiple bytes and also have different representations depending on the encoding.

Now think what doing mystring[i] would mean if that returned a rune and not a byte. Since you cannot know the length of each rune without scanning the string, that operation would require scanning the whole string every single time, thus making array-like access take O(n) instead of O(1).

It would be very counter-intuitive for the users of the language if mystring[i] scanned the whole string every time, and also more complex for the language developers. This is why most programming languages (like Go, Rust, Python) differentiate between Unicode characters and bytes, and sometimes only support indexing on bytes.

Accessing a string one rune at a time is instead much simpler when iterating from the beginning of it, like for example using range. Consecutive bytes can be scanned and grouped together until they form a valid Unicode character that can be returned as a rune, moving on to the next one.

answered Oct 11 '22 14:10

Marco Bonelli

Related questions
                            
                                Detect if a command is piped or not
                            
                                JSON decoded value is treated as float64 instead of int
                            
                                go tutorial select statement
                            
                                Is it possible to dynamically load Go code?
                            
                                Memory leak in Go http standard library?
                            
                                Do we need to close the response object if an error occurs while calling http.Get(url)?
                            
                                Mime type checking of files uploaded Golang
                            
                                Specifying DNS server for lookup in Go
                            
                                How do I mock a function that write result to it's argument in Go
                            
                                What is the meaning of "...Type" in Go?
                            
                                Bodiless function in Golang
                            
                                Golang convert type [N]byte to []byte [duplicate]
                            
                                Change color of a single pixel - Golang image
                            
                                Connecting to MongoDB Atlas using Golang mgo: Persistent no reachable server to replica set
                            
                                Multiple tags on the same Go struct member
                            
                                How to encrypt and decrypt plain text with a RSA keys in Go?
                            
                                looking for a call or thread id to use for logging
                            
                                Passing http.ResponseWriter by value or reference?
                            
                                Why should constructor of Go return address?
                            
                                Using "go get" to download binaries without adding them to go.mod

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With