Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed

When I try to use:

df[df.columns.difference(['pos', 'neu', 'neg', 'new_description'])].to_csv('sentiment_data.csv')

I get the error:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed

I don't understand what this error means and how I can fix this error and export my data to a csv/excel. I have referred to this question but I don't understand much and it doesn't answer how to do this with pandas.

What does position 388 mean? What is the character '\ud83d'?

I get a different error position when I try to export to an excel:

df[df.columns.difference(['pos', 'neu', 'neg', 'new_description'])].to_excel('sentiment_data_new.xlsx')

Error while exporting to excel:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 261: surrogates not allowed

Why is the position different when it's the same encoding?

The other duplicate questions don't answer how to escape this error with pandas DataFrame.

like image 746
Mohit Motwani Avatar asked Feb 05 '19 14:02

Mohit Motwani


1 Answers

Emojis in Unicode lie outside the Basic Multilingual Pane, which means they have codepoints that won't fit in 16 bits. Surrogate pairs are a way to make these glyphs directly representable in UTF-16 as a pair of 16-bit codepoints.

You can force surrogate pairs to be resolved into the corresponding codepoint outside the BMP like this:

"\ud83d\ude04".encode('utf-16','surrogatepass').decode('utf-16')

This will give you the codepoint \U0001f604. Note how it takes more than 4 hex digits to express.

But this solution may only get you so far.

A lot of software (including pygame and older versions of IDLE, and PowerShell, and the Windows command prompt) only supports the BMP, because it doesn't really use UTF-16 but its predecessor UCS-2, which is essentially UTF-16 but without support for codepoints outside the BMP.

When this answer was originally posted, in IDLE 3.7 and before, print ('\U0001f604') would just raise a UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f604' in position 0: Non-BMP character not supported in Tk.

Python 3.8 finally fixed this and the fixes were backported to subsequent releases of Python 3.7, so in IDLE now, you can either provide the 17-bit codepoint:

print ('\U0001f604')

or transcode the UTF-16 surrogate pair to the same codepoint:

print ("\ud83d\ude04".encode('utf-16','surrogatepass').decode('utf-16'))

and both will print 😄.

What you still cannot do is print the UTF-16 surrogate pair as is: if you try print ("\ud83d\ude04") you will get the same \u escapes back.

like image 166
BoarGules Avatar answered Nov 14 '22 22:11

BoarGules