Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Golang Decoding/Unmarshaling invalid unicode in JSON

Tags:

json

unicode

go

I am fetching JSON files in go that are not formatted homogeneously. For Example, I can have the following:

{"email": "\"[email protected]\""}
{"email": "[email protected]"}
{"name": "m\303\203ead"}

We can see that there will be a problem with the escaping character. Using json.Decode:

With:

{"name": "m\303\203ead"}

I get the error: invalid character '3' in string escape code

I have tried several approaches to normalise my data for example by passing by a string array (it works but there is too many edge cases), or even to filter escape characters.

Finally, I came through this article: (http://blog.golang.org/normalization) And the solution they proposed seemed very interesting.

I have tried the following

isMn := func(r rune) bool {
    return unicode.Is(unicode.Mn, r)
}

t := transform.Chain(norm.NFC, transform.RemoveFunc(isMn), norm.NFD)

fileReader, err := bucket.GetReader(filename)

transformReader := transform.NewReader(fileReader, t)

decoder := json.NewDecoder(tReader)

for {
    var dataModel Model
    if err := decoder.Decode(&kmData); err == io.EOF {
        break
    } else {
      // DO SOMETHING
    }
}

With Model being:

type Model struct {
    Name  string `json:"name" bson:"name"`
    Email string `json:"email" bson:"email"` 
}

I have tried several variations of it, but haven't been able to have it working.

So my question is how to easily handle decoding/unmarshaling JSON data with different encodings? Knowing, that I have no control on those JSON files.

If you are reading this, thank you anyway.

like image 817
Antobiotics Avatar asked Jun 18 '14 19:06

Antobiotics


People also ask

How do I decode JSON in Golang?

To parse JSON, we use the Unmarshal() function in package encoding/json to unpack or decode the data from JSON to a struct. Unmarshal parses the JSON-encoded data and stores the result in the value pointed to by v. Note: If v is nil or not a pointer, Unmarshal returns an InvalidUnmarshalError.

How does JSON Unmarshal work Golang?

To unmarshal a JSON array into a slice, Unmarshal resets the slice length to zero and then appends each element to the slice. As a special case, to unmarshal an empty JSON array into a slice, Unmarshal replaces the slice with a new empty slice.

What is JSON NewEncoder in Golang?

Start Learning. func NewEncoder(w io. Writer) *Encoder is a function defined in the encoding/json package which gets a JSON encoding of any type and encodes/writes it any writable stream that implements a io. Writer interface.

What is JSON marshalling and Unmarshalling?

Go's terminology calls marshal the process of generating a JSON string from a data structure, and unmarshal the act of parsing JSON to a data structure.


1 Answers

You can use json.RawMessage instead of string, that way json.Decode won't try to decode the invalid characters.

playground : http://play.golang.org/p/fB-38KGAO0

type Model struct {
    N  json.RawMessage `json:"name" bson:"name"`
}

func (m *Model) Name() string {
    return string(m.N)
}
func main() {
    s := "{\"name\": \"m\303\203ead\"}"
    r := strings.NewReader(s)
    d := json.NewDecoder(r)
    m := Model{}

    fmt.Println(d.Decode(&m))
    fmt.Println(m.Name())
}

Edit: Well, you can use regex, not sure how viable that is for you http://play.golang.org/p/VYJKTKmiYm:

func cleanUp(s string) string {
    re := regexp.MustCompile(`\b(\\\d\d\d)`)
    return re.ReplaceAllStringFunc(s, func(s string) string {
        return `\u0` + s[1:]
    })
}
func main() {
    s := "{\"name\": \"m\303\203ead\"}"
    s = cleanUp(s)
    r := strings.NewReader(s)
    d := json.NewDecoder(r)
    m := Model{}
    fmt.Println(d.Decode(&m))
    fmt.Println(m.Name())
}
like image 77
OneOfOne Avatar answered Sep 30 '22 20:09

OneOfOne