The correct way to load and read JSON file contains special characters in Python

Question

I'm working with a JSON file contains some unknown-encoded strings as the example below:

"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"

I have loaded this text by using json.load() function in Python 3.7 environment and tried to encode/decode it with some methods I found around the Internet but I still cannot get the proper string as I expected. (In this case, it has to be Lê Nguyễn Phú).

My question is, which is the encoding method they used and how to parse this text in a proper way in Python?

Because the JSON file comes from an external source that I didn't handle so that I cannot know or make any changes in the process of encoding the text.

[Updated] More details:

The JSON file looks like this:

{
 "content":"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
}

Firstly, I loaded the JSON file:

 with open(json_path, 'r') as f:
        data = json.load(f)

But when I extract the content, it's not what I expected:

string = data.get('content', '')
print(string)

'LÃª Nguyá»\x85n PhÃº'

hobbs · Accepted Answer

Someone took "Lê Nguyễn Phú", encoded that as UTF-8, and then took the resulting series of bytes and lied to a JSON encoder by telling it that those bytes were the characters of a string. The JSON encoder then cooperatively produced garbage by encoding those characters. But it is reversible garbage. You can reverse this process using something like

json.loads(in_string).encode("latin_1").decode("utf_8")

Which decodes the string from the JSON, extracts the bytes from it (the 256 symbols in Latin-1 are in a 1-to-1 correspondence with the first 256 Unicode codepoints), and then re-decodes those bytes as UTF-8.

The big problem with this technique is that it only works if you are sure that all of your input is garbled in this fashion... there's no completely reliable way to look at an input and decide whether it should have this broken decoding applied to it. If you try to apply it to a validly-encoded string containing codepoints above U+00FF, it will crash. But if you try to apply it to a validly-encoding string containing only codepoints up to U+00FF, it will turn your perfectly good string into a different kind of garbage.

The correct way to load and read JSON file contains special characters in Python

Tags:

python

json

string

python-3.x

unicode

nguyendhn

1 Answers

hobbs

Recent Activity

Donate For Us

The correct way to load and read JSON file contains special characters in Python

Tags:

python

json

string

python-3.x

unicode

nguyendhn

1 Answers

hobbs

Related questions

Recent Activity

Donate For Us