Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert unicode code point to literal character in Go

Tags:

unicode

go

Let's say I have a text file like this.

\u0053
\u0075
\u006E

Is there a way I can convert that to this?

S
u
n

Currently, I'm using ioutil.ReadFile("data.txt"), but when I print the data, I get the unicode code points instead of the string literals. I realize this is the correct behavior for ReadFile, it's just not want I want.

I'm aiming for a substitution of the code points with their literal characters.

like image 335
425nesp Avatar asked Dec 07 '15 04:12

425nesp


People also ask

How do I convert Unicode to ASCII?

You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.

Can we convert Unicode to text?

World's simplest unicode tool. This browser-based utility converts fancy Unicode text back to regular text. All Unicode glyphs that you paste or enter in the text area as the input automatically get converted to simple ASCII characters in the output.

Are Go strings Unicode?

Strings can be created by enclosing a set of characters inside double quotes " " . Let's look at a simple example that creates a string and prints it. The above program will print Hello World . Strings in Go are Unicode compliant and are UTF-8 Encoded.

Are Go strings UTF-8?

In short, Go source code is UTF-8, so the source code for the string literal is UTF-8 text.


2 Answers

You can use the strconv.Unquote() and strconv.UnquoteChar() functions to do the conversion.

One thing you should be aware of is that strconv.Unquote() can only unquote strings that are in quotes (e.g. start and end with a quote char " or a back quote char `), so we have to manually append that.

See this example:

lines := []string{
    `\u0053`,
    `\u0075`,
    `\u006E`,
}
fmt.Println(lines)

for i, v := range lines {
    var err error
    lines[i], err = strconv.Unquote(`"` + v + `"`)
    if err != nil {
        fmt.Println(err)
    }
}
fmt.Println(lines)

fmt.Println(strconv.Unquote(`"Go\u0070\x68\x65\x72"`))

Output (try it on the Go Playground):

[\u0053 \u0075 \u006E]
[S u n]
Gopher <nil>

If the strings you want to unquote contain the escape sequence of a single rune (or you just want to unquote the first rune), you may use strconv.UnquoteChar(). This is how it looks like (note: no quoting of the input is needed in this case, like it was needed for strconv.Unquote()):

runes := []string{
    `\u0053`,
    `\u0075`,
    `\u006E`,
}
fmt.Println(runes)

for _, v := range runes {
    var err error
    value, _, _, err := strconv.UnquoteChar(v, 0)
    if err != nil {
        fmt.Println(err)
    }
    fmt.Printf("%c\n", value)
}

This will output (try it on the Go Playground):

[\u0053 \u0075 \u006E]
S
u
n
like image 159
icza Avatar answered Sep 30 '22 11:09

icza


A slightly different approach is using strconv.ParseInt, this generates less garbage and uses less internal logic (Unquote does a lot of other checks) for parsing the lines:

for i, v := range lines {
    if len(v) != 6 {
        continue
    }

    if r, err := strconv.ParseInt(v[2:], 16, 32); err == nil {
        lines[i] = string(r)
    }
}

playground

like image 20
OneOfOne Avatar answered Sep 30 '22 11:09

OneOfOne