I am trying to efficiently count runes from a utf-8 string
using the utf8 library. Is this example optimal in that it does not copy the underlying data?
https://golang.org/pkg/unicode/utf8/#example_DecodeRuneInString
func main() {
str := "Hello, 世界" // let's assume a runtime-provided string
for len(str) > 0 {
r, size := utf8.DecodeRuneInString(str)
fmt.Printf("%c %v\n", r, size)
str = str[size:] // performs copy?
}
}
I found StringHeader in the (unsafe) reflect library. Is this the exact structure of a string
in Go? If so, it is conceivable that slicing a string merely updates Data
or allocates a new StringHeader
altogether.
type StringHeader struct {
Data uintptr
Len int
}
Bonus: where can I find the code that performs string
slicing so that I could look it up myself? Any of these?
https://golang.org/src/runtime/slice.go
https://golang.org/src/runtime/string.go
This related SO answer suggests that runtime-strings incur a copy when converted from string
to []byte
.
does slice of string perform copy of underlying data?
No it does not. See this post by Russ Cox:
A string is represented in memory as a 2-word structure containing a pointer to the string data and a length. Because the string is immutable, it is safe for multiple strings to share the same storage, so slicing s results in a new 2-word structure with a potentially different pointer and length that still refers to the same byte sequence. This means that slicing can be done without allocation or copying, making string slices as efficient as passing around explicit indexes.
-- Go Data Structures
A slice is basically three things: a length, a capacity, and a pointer to a location in an underlying array.
As such, slices themselves are not very large: ints and a pointer (possibly some other small things in implementation detail). So the allocation required to make a copy of a slice is very small, and doesn't depend on the size of the underlying array. And no new allocation is required when you simply update the length, capacity, and pointer location, such as on line 2 of:
foo := []int{3, 4, 5, 6}
foo = foo[1:]
Rather, it's when a new underlying array has to be allocated that a performance impact is felt.
Strings in Go are immutable. So to change a string you need to make a new string. However, strings are closely related to byte slices, e.g. you can create a byte slice from a string with
foo := `here's my string`
fooBytes := []byte(foo)
I believe that will allocate a new array of bytes, because:
a string is in effect a read-only slice of bytes
according to the Go Blog (see Strings, bytes, runes and characters in Go). In general you can use a slice to change the contents of an underlying array, so to produce a usable byte slice from a string you would have to make a copy to keep the user from changing what's supposed to be immutable.
You could use performance profiling and benchmarking to gain further insight into the performance of your program.
Once you have your slice of bytes, fooBytes
, reslicing it does not allocate a new array, it just allocates a new slice, which is small. This appears to be what slicing a string does as well.
Note that you don't need to use the utf8
package to count words in a utf8 string, though you may proceed that way if you like. Go handles utf8 natively. However if you want to iterate over characters you can't represent the string as a slice of bytes, because you could have multibyte characters. Instead you need to represent it as a slice of runes:
foo := `here's my string`
fooRunes := []rune(foo)
This operation of converting a string to a slice of runes is fast in my experience (trivial in benchmarks I've done, but there may be an allocation). Now you can iterate across fooRunes
to count words, no utf8
package required. Alternatively, you can skip the explicit []rune(foo)
conversion and do it implicitly by using a for ... range
loop on the string, because those are special:
A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value.
-- Strings, bytes, runes and characters in Go
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With