Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does slice of string perform copy of underlying data?

I am trying to efficiently count runes from a utf-8 string using the utf8 library. Is this example optimal in that it does not copy the underlying data?
https://golang.org/pkg/unicode/utf8/#example_DecodeRuneInString

func main() {
    str := "Hello, 世界" // let's assume a runtime-provided string
    for len(str) > 0 {
        r, size := utf8.DecodeRuneInString(str)
        fmt.Printf("%c %v\n", r, size)
        str = str[size:] // performs copy?
    }
}

I found StringHeader in the (unsafe) reflect library. Is this the exact structure of a string in Go? If so, it is conceivable that slicing a string merely updates Data or allocates a new StringHeader altogether.

type StringHeader struct {
        Data uintptr
        Len  int
}

Bonus: where can I find the code that performs string slicing so that I could look it up myself? Any of these?
https://golang.org/src/runtime/slice.go
https://golang.org/src/runtime/string.go

This related SO answer suggests that runtime-strings incur a copy when converted from string to []byte.

like image 783
Razzle Shazl Avatar asked Sep 18 '18 23:09

Razzle Shazl


1 Answers

Slicing Strings

does slice of string perform copy of underlying data?

No it does not. See this post by Russ Cox:

A string is represented in memory as a 2-word structure containing a pointer to the string data and a length. Because the string is immutable, it is safe for multiple strings to share the same storage, so slicing s results in a new 2-word structure with a potentially different pointer and length that still refers to the same byte sequence. This means that slicing can be done without allocation or copying, making string slices as efficient as passing around explicit indexes.

-- Go Data Structures

Slices, Performance, and Iterating Over Runes

A slice is basically three things: a length, a capacity, and a pointer to a location in an underlying array.

As such, slices themselves are not very large: ints and a pointer (possibly some other small things in implementation detail). So the allocation required to make a copy of a slice is very small, and doesn't depend on the size of the underlying array. And no new allocation is required when you simply update the length, capacity, and pointer location, such as on line 2 of:

foo := []int{3, 4, 5, 6}
foo = foo[1:]

Rather, it's when a new underlying array has to be allocated that a performance impact is felt.

Strings in Go are immutable. So to change a string you need to make a new string. However, strings are closely related to byte slices, e.g. you can create a byte slice from a string with

foo := `here's my string`
fooBytes := []byte(foo)

I believe that will allocate a new array of bytes, because:

a string is in effect a read-only slice of bytes

according to the Go Blog (see Strings, bytes, runes and characters in Go). In general you can use a slice to change the contents of an underlying array, so to produce a usable byte slice from a string you would have to make a copy to keep the user from changing what's supposed to be immutable.

You could use performance profiling and benchmarking to gain further insight into the performance of your program.

Once you have your slice of bytes, fooBytes, reslicing it does not allocate a new array, it just allocates a new slice, which is small. This appears to be what slicing a string does as well.

Note that you don't need to use the utf8 package to count words in a utf8 string, though you may proceed that way if you like. Go handles utf8 natively. However if you want to iterate over characters you can't represent the string as a slice of bytes, because you could have multibyte characters. Instead you need to represent it as a slice of runes:

foo := `here's my string`
fooRunes := []rune(foo)

This operation of converting a string to a slice of runes is fast in my experience (trivial in benchmarks I've done, but there may be an allocation). Now you can iterate across fooRunes to count words, no utf8 package required. Alternatively, you can skip the explicit []rune(foo) conversion and do it implicitly by using a for ... range loop on the string, because those are special:

A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value.

-- Strings, bytes, runes and characters in Go

like image 80
jrefior Avatar answered Oct 04 '22 22:10

jrefior