I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json
module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load()
method is 'HornÃ\xadková'
. The string should be correctly decoded as 'Horníková'
instead.
I read the JSON specification and I understasnd that after \u
there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u
-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?
Your text is already encoded and you need to tell this to Python by using a b
prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape'
encoding to convert the string to byte without encoding and prevent the open
method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads()
instead of json.load()
.
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With