I get this on json.Marshal of a list of strings:
json: invalid UTF-8 in string: "...ole\xc5\"
The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode
and unicode/utf8
packages and there seems no obvious/quick way to do it.
In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?
UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.
(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)
Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it. May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.
To automatically find and delete non-UTF-8 characters, we're going to use the iconv command. It is used in Linux systems to convert text from one character encoding to another.
This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.
In Go 1.13+, you can do this:
strings.ToValidUTF8("a\xc5z", "")
In Go 1.11+, it's also very easy to do the same using the Map function and utf8.RuneError like this:
fixUtf := func(r rune) rune {
if r == utf8.RuneError {
return -1
}
return r
}
fmt.Println(strings.Map(fixUtf, "a\xc5z"))
fmt.Println(strings.Map(fixUtf, "posic�o"))
Output:
az
posico
Playground: Here.
For example,
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "a\xc5z"
fmt.Printf("%q\n", s)
if !utf8.ValidString(s) {
v := make([]rune, 0, len(s))
for i, r := range s {
if r == utf8.RuneError {
_, size := utf8.DecodeRuneInString(s[i:])
if size == 1 {
continue
}
}
v = append(v, r)
}
s = string(v)
}
fmt.Printf("%q\n", s)
}
Output:
"a\xc5z"
"az"
Unicode Standard
FAQ - UTF-8, UTF-16, UTF-32 & BOM
Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?
A: None of the UTFs can generate every arbitrary byte sequence. For example, in UTF-8 every byte of the form 110xxxxx2 must be followed with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue processing at the second byte 0xxxxxxx2.
A conformant process must not interpret illegal or ill-formed byte sequences as characters, however, it may take error recovery actions. No conformant process may use irregular byte sequences to encode out-of-band information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With