Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a String into 10kb chunks in Go

Tags:

go

I have a large string in Go and I'd like to split it up into smaller chunks. Each chunk should be at most 10kb. The chunks should be split on runes (not in the middle of a rune).

What is the idiomatic way to do this in go? Should I just be looping over the range of the string bytes? Am I missing some helpful stdlib packages?

like image 981
aloo Avatar asked Jul 20 '15 08:07

aloo


2 Answers

Use RuneStart to scan for a rune boundary. Slice the string at the boundary.

var chunks []string
for len(s) > 10000 {
    i := 10000
    for i >= 10000 - utf8.UTFMax && !utf8.RuneStart(s[i]) {
        i--
    }
    chunks = append(chunks, s[:i])
    s = s[i:]
}
if len(s) > 0 {
    chunks = append(chunks, s)
}

Using the approach, the application examines a few bytes at the chunk boundaries instead of the entire string.

The code is written to guarantee progress when the string is not a valid UTF-8 encoding. You might want to handle this situation as an error or split the string in a different way.

playground example

like image 53
Bayta Darell Avatar answered Oct 02 '22 21:10

Bayta Darell


The idiomatic way to split a string (or any slice or array) is by using slicing. Since you want to split by rune you'd have to loop through the entire string since you don't know in advance how many bytes each slice will contain.

slices := []string{}
count := 0
lastIndex := 0
for i, r := range longString {
    count++
    if count%10001 == 0 {
        slices = append(slices, longString[lastIndex:i])
        lastIndex = i
    }
}

Warning: I have not run or tested this code, but it conveys the general principles. Looping over a string loops over the runes and not the bytes, automatically decoding the UTF-8 for you. And using the slice operator [] represents your new strings as subslices of longString which means that no bytes from the string needs to be copied.

Note that i is the byte index in the string and may be incremented by more that 1 in each loop iteration.

EDIT:

Sorry, I didn't see you wanted to limit the number of bytes, not Unicode code points. You can implement that as well relatively easily.

slices := []string{}
lastIndex := 0
lastI := 0
for i, r := range longString {
    if i-lastIndex > 10000 {
        slices = append(slices, longString[lastIndex:lastI])
        lastIndex = lastI
    }
    lastI = i
}

A working example at play.golang.org, which also takes care of the leftover bytes at the end of the string.

like image 30
Johan Wikström Avatar answered Oct 02 '22 21:10

Johan Wikström