Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python decoding issue with Chinese characters

I'm using Python 3.5, and I'm trying to take a block of byte text that may or may not contain special Chinese characters and output it to a file. It works for entries that do not contain Chinese characters, but breaks when they do. The Chinese characters are always a person's name, and are always in addition to the English spelling of their name. The text is JSON formatted and needs to be decoded before I can load it. The decoding seems to go fine and doesn't give me any errors. When I try and write the decoded text to a file it gives me the following error message:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-18: character maps to undefined

Here is an example of the raw data that I get before I do anything to it:

 b'  "isBulkRecipient": "false",\r\n      "name": "Name in, English \xef'
 b'\xab\x62\xb6\xe2\x15\x8a\x8b\x8a\xee\xab\x89\xcf\xbc\x8a",\r\n

Here is the code that I am using:

recipientData = json.loads(recipientContent.decode('utf-8', 'ignore'))
recipientName = recipientData['signers'][0]['name']
pprint(recipientName)
with open('envelope recipient list.csv', 'a', newline='') as fp:
    a = csv.writer(fp, delimiter=',')
    csvData = [[recipientName]]
    a.writerows(csvData)

The recipientContent is obtained from an API call. I do not need to have the Chinese characters in the output file. Any advice will be greatly appreciated!

Update:

I've been doing some manual workarounds for each entry that breaks, and came other entries that didn't contain Chinese special characters, but had them from other languages, and the broke the program as well. The special characters are only in the name field. So a name could be something like "Ałex" where it is a mixture of normal and special characters. Before i decode the string that contains this information i am able to print it out to the screen and it looks like this: b'name": "A\xc5ex",\r\n

But after i decode it into utf-8 it will give me an error if i try to output it. The error message is: UnicodeEncodeError: 'charmap' codec can't encode character 'u0142' in position 2- character maps to -undefined-

I looked up what \u0142 was and it is the ł special character.

like image 825
Alex Hall Avatar asked Jun 28 '16 18:06

Alex Hall


People also ask

Does UTF-8 cover Chinese?

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters.

What encoding to use for Chinese characters?

English and the other Latin languages use ASCII encoding; Simplified Chinese uses GB2312 encoding, Traditional Chinese uses Big 5 encoding, and so forth. In other words, a computer using Big 5 encoding cannot read computer code in GB2312 or ASCII encoding.


1 Answers

The error you're getting is when you're writing to the file.

In Python 3.x, when you open() in text mode (the default) without specifying an encoding=, Python will use an encoding most suitable to your locale or language settings.

If you're on Windows, this will use the charmap codec to map to your language encoding.

Although you could just write bytes straight to a file, you're doing the right thing by decoding it first. As others have said, you should really decode using the encoding specified by the web server. You could also use Python Requests module, which does this for you. (You example doesn't decode as UTF-8, so I assume your example isn't correct)

To solve your immediate error, simply pass an encoding to open(), which supports the characters you have in your data. Unicode in UTF-8 encoding is the obvious choice. Therefore, you should change your code to read:

with open('envelope recipient list.csv', 'a', encoding='utf-8', newline='') as fp:
like image 76
Alastair McCormack Avatar answered Sep 19 '22 14:09

Alastair McCormack