I am trying to write a function to truncate strings with special characters in golang. One example is below
"H㐀〾▓朗퐭텟şüöžåйкл¤"
However I am doing it based on the number of characters allowed and cutting it in the middle. This results in data getting corrupted.
The result comes out like
H㐀〾▓朗퐭텟şüöžå�...
The �
should not be there. How do we detect these special characters and split it based on the length of these characters?
package main
import (
"fmt"
"regexp"
)
var reNameBlacklist = regexp.MustCompile(`(&|>|<|\/|:|\n|\r)*`)
var maxFileNameLength = 30
// SanitizeName sanitizes user names in an email
func SanitizeName(name string, limit int) string {
result := name
reNameBlacklist.ReplaceAllString(result, "")
if len(result) > limit {
result = result[:limit] + "..."
}
return result
}
func main() {
str := "H㐀〾▓朗퐭텟şüöžåйкл¤"
fmt.Println(str)
strsan := SanitizeName(str, maxFileNameLength)
fmt.Println(strsan)
}
Slicing strings treats them as their underlying byte array; the slice operator operates on indexes of bytes, not of runes (which can be multiple bytes each). However, range
over a string iterates on runes - but the index returned is of bytes. This makes it fairly straightforward to do what you're looking for (full playground example here):
func SanitizeName(name string, limit int) string {
name = reNameBlacklist.ReplaceAllString(name, "")
result := name
chars := 0
for i := range name {
if chars >= limit {
result = name[:i]
break
}
chars++
}
return result
}
This is explained in further detail on the Go blog
Update:
As commenters below suggest, you can normalize arbitrary UTF8 to NFC (Normalization Form Canonical Composition), which combines some multi-rune forms like diacritics into single-rune forms where possible. This adds a single step using golang.org/x/text/unicode/norm
. Playground example of this here: https://play.golang.org/p/93qxI11km2f
func SanitizeName(name string, limit int) string {
name = norm.NFC.String(name)
name = reNameBlacklist.ReplaceAllString(name, "")
result := name
chars := 0
for i := range name {
if chars >= limit {
result = name[:i]
break
}
chars++
}
return result
}
The reason your data is getting corrupted is because some characters use more than one byte and you are splitting them. To avoid this Go has type rune
which represents a UTF-8 character. You can just cast the string to a []rune
like this:
func SanitizeName(name string, limit int) string{
reNameBlacklist.ReplaceAllString(name, "")
result := []rune(name)
// Remove the special chars here
return string(result[:limit])
}
This should only leave the first limit UTF-8 characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With