Python decoding issue with Chinese characters

Tags:

I'm using Python 3.5, and I'm trying to take a block of byte text that may or may not contain special Chinese characters and output it to a file. It works for entries that do not contain Chinese characters, but breaks when they do. The Chinese characters are always a person's name, and are always in addition to the English spelling of their name. The text is JSON formatted and needs to be decoded before I can load it. The decoding seems to go fine and doesn't give me any errors. When I try and write the decoded text to a file it gives me the following error message:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-18: character maps to undefined

Here is an example of the raw data that I get before I do anything to it:

 b'  "isBulkRecipient": "false",\r\n      "name": "Name in, English \xef'
 b'\xab\x62\xb6\xe2\x15\x8a\x8b\x8a\xee\xab\x89\xcf\xbc\x8a",\r\n

Here is the code that I am using:

recipientData = json.loads(recipientContent.decode('utf-8', 'ignore'))
recipientName = recipientData['signers'][0]['name']
pprint(recipientName)
with open('envelope recipient list.csv', 'a', newline='') as fp:
    a = csv.writer(fp, delimiter=',')
    csvData = [[recipientName]]
    a.writerows(csvData)

The recipientContent is obtained from an API call. I do not need to have the Chinese characters in the output file. Any advice will be greatly appreciated!

Update:

I've been doing some manual workarounds for each entry that breaks, and came other entries that didn't contain Chinese special characters, but had them from other languages, and the broke the program as well. The special characters are only in the name field. So a name could be something like "Ałex" where it is a mixture of normal and special characters. Before i decode the string that contains this information i am able to print it out to the screen and it looks like this: b'name": "A\xc5ex",\r\n

But after i decode it into utf-8 it will give me an error if i try to output it. The error message is: UnicodeEncodeError: 'charmap' codec can't encode character 'u0142' in position 2- character maps to -undefined-

I looked up what \u0142 was and it is the ł special character.

825

asked Jun 28 '16 18:06

Alex Hall

1 Answers

The error you're getting is when you're writing to the file.

In Python 3.x, when you open() in text mode (the default) without specifying an encoding=, Python will use an encoding most suitable to your locale or language settings.

If you're on Windows, this will use the charmap codec to map to your language encoding.

Although you could just write bytes straight to a file, you're doing the right thing by decoding it first. As others have said, you should really decode using the encoding specified by the web server. You could also use Python Requests module, which does this for you. (You example doesn't decode as UTF-8, so I assume your example isn't correct)

To solve your immediate error, simply pass an encoding to open(), which supports the characters you have in your data. Unicode in UTF-8 encoding is the obvious choice. Therefore, you should change your code to read:

with open('envelope recipient list.csv', 'a', encoding='utf-8', newline='') as fp:

answered Sep 19 '22 14:09

Alastair McCormack

Related questions
                            
                                Tensorflow import error on Pycharm (Mac)
                            
                                Change model representation in Flask-Admin without modifying model
                            
                                Method regex.scanner() cannot be found in the Python 3.5.1 documentation, but the Interpreter works well
                            
                                In python 3.5, how do I compare a string variable with part of another string? [duplicate]
                            
                                Commands working on windows command line but not in Git Bash terminal
                            
                                Python not getting raw binary from subprocess.check_call
                            
                                Python serial - Attempting to use a port that is not open
                            
                                Pyinstaller Error - "setuptools distribution was not found"
                            
                                Something like __pycache__ for Python 2.x?
                            
                                2-D Matrix: Finding and deleting columns that are subsets of other columns
                            
                                Scrapy spider that only crawls URLs once
                            
                                How to interpret Singular Value Decomposition results (Python 3)?
                            
                                How to structure python project with dot "." or underscore "-" in project/package name?
                            
                                Python, why does mmap.move() fill up the memory?
                            
                                How do I mask the padding in a BLSTM in Keras?
                            
                                Python Visual Studio extension doesn't show errors
                            
                                ImportError: libboost_iostreams.so.1.61.0: cannot open shared object file: No such file or directory
                            
                                How to input data within Jupyter Notebook
                            
                                sknn - input dimension mismatch on second fit
                            
                                gensim: custom similarity measure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python decoding issue with Chinese characters

Tags:

python

python-3.x

encoding

python-unicode

Alex Hall

People also ask

1 Answers

Alastair McCormack

Recent Activity

Donate For Us