Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove invalid UTF-8 characters from a string

Tags:

json

unicode

go

I get this on json.Marshal of a list of strings:

json: invalid UTF-8 in string: "...ole\xc5\"

The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it.

In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?

UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.

(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)

like image 773
LetMeSOThat4U Avatar asked Dec 05 '13 13:12

LetMeSOThat4U


People also ask

How can I change a non UTF-8 character from a text file?

Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it. May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.

How do you remove a non UTF-8 character from a text file in Unix?

To automatically find and delete non-UTF-8 characters, we're going to use the iconv command. It is used in Linux systems to convert text from one character encoding to another.

What is an invalid UTF-8 string?

This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.


2 Answers

In Go 1.13+, you can do this:

strings.ToValidUTF8("a\xc5z", "")

In Go 1.11+, it's also very easy to do the same using the Map function and utf8.RuneError like this:

fixUtf := func(r rune) rune {
    if r == utf8.RuneError {
        return -1
    }
    return r
}

fmt.Println(strings.Map(fixUtf, "a\xc5z"))
fmt.Println(strings.Map(fixUtf, "posic�o"))

Output:

az
posico

Playground: Here.

like image 137
Inanc Gumus Avatar answered Sep 17 '22 16:09

Inanc Gumus


For example,

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    s := "a\xc5z"
    fmt.Printf("%q\n", s)
    if !utf8.ValidString(s) {
        v := make([]rune, 0, len(s))
        for i, r := range s {
            if r == utf8.RuneError {
                _, size := utf8.DecodeRuneInString(s[i:])
                if size == 1 {
                    continue
                }
            }
            v = append(v, r)
        }
        s = string(v)
    }
    fmt.Printf("%q\n", s)
}

Output:

"a\xc5z"
"az"

Unicode Standard

FAQ - UTF-8, UTF-16, UTF-32 & BOM

Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?

A: None of the UTFs can generate every arbitrary byte sequence. For example, in UTF-8 every byte of the form 110xxxxx2 must be followed with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue processing at the second byte 0xxxxxxx2.

A conformant process must not interpret illegal or ill-formed byte sequences as characters, however, it may take error recovery actions. No conformant process may use irregular byte sequences to encode out-of-band information.

like image 31
peterSO Avatar answered Sep 17 '22 16:09

peterSO