Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The correct way to load and read JSON file contains special characters in Python

I'm working with a JSON file contains some unknown-encoded strings as the example below:

"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"

I have loaded this text by using json.load() function in Python 3.7 environment and tried to encode/decode it with some methods I found around the Internet but I still cannot get the proper string as I expected. (In this case, it has to be Lê Nguyễn Phú).

My question is, which is the encoding method they used and how to parse this text in a proper way in Python?

Because the JSON file comes from an external source that I didn't handle so that I cannot know or make any changes in the process of encoding the text.

[Updated] More details:

The JSON file looks like this:

{
 "content":"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
}

Firstly, I loaded the JSON file:

 with open(json_path, 'r') as f:
        data = json.load(f)

But when I extract the content, it's not what I expected:

string = data.get('content', '')
print(string)

'Lê Nguyá»\x85n Phú'
like image 215
nguyendhn Avatar asked Nov 16 '25 03:11

nguyendhn


1 Answers

Someone took "Lê Nguyễn Phú", encoded that as UTF-8, and then took the resulting series of bytes and lied to a JSON encoder by telling it that those bytes were the characters of a string. The JSON encoder then cooperatively produced garbage by encoding those characters. But it is reversible garbage. You can reverse this process using something like

json.loads(in_string).encode("latin_1").decode("utf_8")

Which decodes the string from the JSON, extracts the bytes from it (the 256 symbols in Latin-1 are in a 1-to-1 correspondence with the first 256 Unicode codepoints), and then re-decodes those bytes as UTF-8.

The big problem with this technique is that it only works if you are sure that all of your input is garbled in this fashion... there's no completely reliable way to look at an input and decide whether it should have this broken decoding applied to it. If you try to apply it to a validly-encoded string containing codepoints above U+00FF, it will crash. But if you try to apply it to a validly-encoding string containing only codepoints up to U+00FF, it will turn your perfectly good string into a different kind of garbage.

like image 101
hobbs Avatar answered Nov 17 '25 19:11

hobbs



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!