Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decode json string as UTF-8?

I've been working with json for some time and the issue is the strings I decode are encoded as Latin-1 and I cannot get it to work as UTF-8. Because of that, some characters are shown incorrectly (ex. ' shown as ').

I've read a few questions here on stackoverflow, but they doesn't seem to work.

The json structure I'm working with look like this (it is from YouTube API):

...
"items": [
  {
   ...
   "snippet": {
    ...
    "title": "Powerbeats Pro “Totally Wireless” Except when you need a wire",
    ...
    }
   }
  ]

I encode it with:

response = await http.get(link, headers: {HttpHeaders.contentTypeHeader: "application/json; charset=utf-8"});
extractedData = json.decode(response.body);
dataTech = extractedData["items"];

And then what I tried was changing the second line to:

extractedData = json.decode(utf8.decode(response.body));

But this gave me an error about wrong format. So I changed it to:

extractedData = json.decode(utf8.decode(response.bodyBytes));

And this doesn't throw the error, but neither does it fix the problem. Playing around with headers does neither.

I would like the data to be stored in dataTech as they are now, but encoded as UTF-8. What am I doing wrong?

like image 846
Karol Wasowski Avatar asked Apr 26 '19 09:04

Karol Wasowski


People also ask

Can JSON have UTF-8?

The default encoding is UTF-8. (in §6) JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written in UTF-8, JSON is 8bit compatible. When JSON is written in UTF-16 or UTF-32, the binary content-transfer-encoding must be used.

Can you JSON encode a string?

The answer is yes: JSON.

What is JSON encoding and decoding?

Source code: Lib/json/__init__.py. JSON (JavaScript Object Notation), specified by RFC 7159 (which obsoletes RFC 4627) and by ECMA-404, is a lightweight data interchange format inspired by JavaScript object literal syntax (although it is not a strict subset of JavaScript 1 ).


2 Answers

Just an aside first: UTF-8 is typically an external format, and typically represented by an array of bytes. It's what you might send over the network as part of an HTTP response. Internally, Dart stores strings as UTF-16 code points. The utf8 encoder/decoder converts between internal format strings and external format arrays of bytes.

This is why you are using utf8.decode(response.bodyBytes); taking the raw body bytes and converting them to an internal string. (response.body basically does this too, but it chooses the bytes->string decoder based on the response header charset. When this charset header is missing (as it often is) the http package picks Latin-1, which obviously doesn't work if you know that the response is in a different charset.) By using utf8.decode yourself, you are overriding the (potentially wrong) choice being made by http because you know that this particular server always sends UTF-8. (It may not, of course!)

Another aside: setting a content type header on a request is rarely useful. You typically aren't sending any content - so it doesn't have a type! And that doesn't influence the content type or content type charset that the server will send back to you. The accept header might be what you are looking for. That's a hint to the server of what type of content you'd like back - but not all servers respect it.

So why are your special characters still incorrect? Try printing utf8.decode(response.bodyBytes) before decoding it. Does it look right in the console? (It very useful to create a simple Dart command line application for this type of issue; I find it easier to set breakpoints and inspect variables in a simple ten line Dart app.) Try using something like Wireshark to capture the bytes on the wire (again, useful to have the simple Dart app for this). Or try using Postman to send the same request and inspect the response.

How are you trying to show the characters. If may simply be that the font you are using doesn't have them.

like image 187
Richard Heap Avatar answered Sep 21 '22 09:09

Richard Heap


just add the header : 'Accept': 'application/json; charset=UTF-8'; it worked for me

like image 25
Yasser Benmman Avatar answered Sep 21 '22 09:09

Yasser Benmman