Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError: 'gbk' codec can't decode byte when read json contains chinese

I'm switching from Python 2 to 3

In my jupyter notebook the code is

file = "./data/test.json" 
with open(file) as data_file:    
    data = json.load(data_file)

It used to be fine with python 2, but now after just switch to python 3, it gives me the error

UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 123: illegal multibyte sequence

The test.json file is like this:

[{
    "name": "Daybreakers",
    "detail_url": "http://www.movieinsider.com/m4120/daybreakers/",
    "movie_tt_id": "中文"
  }]

If I delete the chinese, there will be no error.

So what should I do?

There are a lot of similar questions in SO, but I didn't find a good solution for my case. If you find an applicable one, please tell me and I'll close this one.

Thanks a lot!

like image 272
cqcn1991 Avatar asked Dec 06 '16 14:12

cqcn1991


1 Answers

You need to specify the correct encoding when you open the file. If the JSON is encoded with UTF-8 you can do this:

import json

fname = "test.json" 
with open(fname, encoding='utf-8') as data_file:    
    data = json.load(data_file)

print(data)

output

[{'name': 'Daybreakers', 'detail_url': 'http://www.movieinsider.com/m4120/daybreakers/', 'movie_tt_id': '中文'}]
like image 129
PM 2Ring Avatar answered Sep 21 '22 17:09

PM 2Ring