I have a large string in Go and I'd like to split it up into smaller chunks. Each chunk should be at most 10kb. The chunks should be split on runes (not in the middle of a rune).
What is the idiomatic way to do this in go? Should I just be looping over the range of the string bytes? Am I missing some helpful stdlib packages?
Use RuneStart to scan for a rune boundary. Slice the string at the boundary.
var chunks []string
for len(s) > 10000 {
i := 10000
for i >= 10000 - utf8.UTFMax && !utf8.RuneStart(s[i]) {
i--
}
chunks = append(chunks, s[:i])
s = s[i:]
}
if len(s) > 0 {
chunks = append(chunks, s)
}
Using the approach, the application examines a few bytes at the chunk boundaries instead of the entire string.
The code is written to guarantee progress when the string is not a valid UTF-8 encoding. You might want to handle this situation as an error or split the string in a different way.
playground example
The idiomatic way to split a string (or any slice or array) is by using slicing. Since you want to split by rune you'd have to loop through the entire string since you don't know in advance how many bytes each slice will contain.
slices := []string{}
count := 0
lastIndex := 0
for i, r := range longString {
count++
if count%10001 == 0 {
slices = append(slices, longString[lastIndex:i])
lastIndex = i
}
}
Warning: I have not run or tested this code, but it conveys the general principles. Looping over a string loops over the runes and not the bytes, automatically decoding the UTF-8 for you. And using the slice operator []
represents your new strings as subslices of longString
which means that no bytes from the string needs to be copied.
Note that i
is the byte index in the string and may be incremented by more that 1 in each loop iteration.
EDIT:
Sorry, I didn't see you wanted to limit the number of bytes, not Unicode code points. You can implement that as well relatively easily.
slices := []string{}
lastIndex := 0
lastI := 0
for i, r := range longString {
if i-lastIndex > 10000 {
slices = append(slices, longString[lastIndex:lastI])
lastIndex = lastI
}
lastI = i
}
A working example at play.golang.org, which also takes care of the leftover bytes at the end of the string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With