Golang truncate strings with special characters without corrupting data

Question

I am trying to write a function to truncate strings with special characters in golang. One example is below

"H㐀〾▓朗퐭텟şüöžåйкл¤"

However I am doing it based on the number of characters allowed and cutting it in the middle. This results in data getting corrupted.

The result comes out like

H㐀〾▓朗퐭텟şüöžå�...

The � should not be there. How do we detect these special characters and split it based on the length of these characters?

package main

import (
    "fmt"
    "regexp"
)

var reNameBlacklist = regexp.MustCompile(`(&|>|<|/|:|
|
)*`)
var maxFileNameLength = 30

// SanitizeName sanitizes user names in an email
func SanitizeName(name string, limit int) string {

    result := name
    reNameBlacklist.ReplaceAllString(result, "")
    if len(result) > limit {
        result = result[:limit] + "..."
    }
    return result
}



func main() {
  str := "H㐀〾▓朗퐭텟şüöžåйкл¤"
    fmt.Println(str)

    strsan := SanitizeName(str, maxFileNameLength)
    fmt.Println(strsan)

}

Adrian · Accepted Answer

Slicing strings treats them as their underlying byte array; the slice operator operates on indexes of bytes, not of runes (which can be multiple bytes each). However, range over a string iterates on runes - but the index returned is of bytes. This makes it fairly straightforward to do what you're looking for (full playground example here):

func SanitizeName(name string, limit int) string {
    name = reNameBlacklist.ReplaceAllString(name, "")
    result := name
    chars := 0
    for i := range name {
        if chars >= limit {
            result = name[:i]
            break
        }
        chars++
    }
    return result
}

This is explained in further detail on the Go blog

Update:

As commenters below suggest, you can normalize arbitrary UTF8 to NFC (Normalization Form Canonical Composition), which combines some multi-rune forms like diacritics into single-rune forms where possible. This adds a single step using golang.org/x/text/unicode/norm. Playground example of this here: https://play.golang.org/p/93qxI11km2f

func SanitizeName(name string, limit int) string {
    name = norm.NFC.String(name)
    name = reNameBlacklist.ReplaceAllString(name, "")
    result := name
    chars := 0
    for i := range name {
        if chars >= limit {
            result = name[:i]
            break
        }
        chars++
    }
    return result
}

Topo · Answer

The reason your data is getting corrupted is because some characters use more than one byte and you are splitting them. To avoid this Go has type rune which represents a UTF-8 character. You can just cast the string to a []rune like this:

func SanitizeName(name string, limit int) string{   
    reNameBlacklist.ReplaceAllString(name, "")
    result := []rune(name)
    // Remove the special chars here
    return string(result[:limit])
}

This should only leave the first limit UTF-8 characters.

Golang truncate strings with special characters without corrupting data

Tags:

utf-8

go

special-characters

Sakib

2 Answers

Adrian

Topo

Recent Activity

Donate For Us

Golang truncate strings with special characters without corrupting data

Tags:

utf-8

go

special-characters

Sakib

2 Answers

Adrian

Topo

Related questions

Recent Activity

Donate For Us