I always seem to be converting strings to []byte to string again over and over. Is there a lot of overhead with this? Is there a better way?
For example, here is a function that accepts a UTF8 string, normalizes it, remove accents, then converts special characters to ASCII equivalent:
var transliterations = map[rune]string{'Æ':"AE",'Ð':"D",'Ł':"L",'Ø':"OE",'Þ':"Th",'ß':"ss",'æ':"ae",'ð':"d",'ł':"l",'ø':"oe",'þ':"th",'Œ':"OE",'œ':"oe"}
func RemoveAccents(s string) string {
b := make([]byte, len(s))
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
_, _, e := t.Transform(b, []byte(s), true)
if e != nil { panic(e) }
r := string(b)
var f bytes.Buffer
for _, c := range r {
temp := rune(c)
if val, ok := transliterations[temp]; ok {
f.WriteString(val)
} else {
f.WriteRune(temp)
}
}
return f.String()
}
So I'm starting with a string because that's what I get, then I'm converting it to a byte array, then back to a string, then to a byte array again, then back to a string again. Surely this is unnecessary but I can't figure out how to not do this..? And does it really have a lot of overhead or do I not have to worry about slowing things down with excessive conversions?
(Also if anyone has the time I've not yet figured out how bytes.Buffer
actually works, would it not be better to initialize a buffer of 2x the size of the string, which is the maximum output size of the return value?)
In Go, string
s are immutable so any change creates a new string. As a general rule, convert from a string
to a byte
or rune
slice once and convert back to a string
once. To avoid reallocations, for small and transient allocations, over-allocate to provide a safety margin if you don't know the exact number.
For example,
package main
import (
"bytes"
"fmt"
"unicode"
"unicode/utf8"
"code.google.com/p/go.text/transform"
"code.google.com/p/go.text/unicode/norm"
)
var isMn = func(r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
var transliterations = map[rune]string{
'Æ': "AE", 'Ð': "D", 'Ł': "L", 'Ø': "OE", 'Þ': "Th",
'ß': "ss", 'æ': "ae", 'ð': "d", 'ł': "l", 'ø': "oe",
'þ': "th", 'Œ': "OE", 'œ': "oe",
}
func RemoveAccents(b []byte) ([]byte, error) {
mnBuf := make([]byte, len(b)*125/100)
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
n, _, err := t.Transform(mnBuf, b, true)
if err != nil {
return nil, err
}
mnBuf = mnBuf[:n]
tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*125/100))
for i, w := 0, 0; i < len(mnBuf); i += w {
r, width := utf8.DecodeRune(mnBuf[i:])
if s, ok := transliterations[r]; ok {
tlBuf.WriteString(s)
} else {
tlBuf.WriteRune(r)
}
w = width
}
return tlBuf.Bytes(), nil
}
func main() {
in := "test stringß"
fmt.Println(in)
inBytes := []byte(in)
outBytes, err := RemoveAccents(inBytes)
if err != nil {
fmt.Println(err)
}
out := string(outBytes)
fmt.Println(out)
}
Output:
test stringß
test stringss
There is no answer to this question. If these conversions are a performance bottleneck in your application you should fix them. If not: Not.
Did you profile your application under realistic load and RemoveAccents
is the bottleneck? No? So why bother?
Really: I assume one could do better (in the sense of less garbage, less iterations and less conversions) e.g. by chaining in some "TransliterationTransformer". But I doubt it would be wirth the hassle.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With