When saving the data
data.to_csv(outp_file, encoding='utf-8')
I sometimes get errors like this
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 233-234: surrogates not allowed
In python3 you can simply replace such characters
>> "abc\udc34xyz".encode('utf-8', 'replace').decode('utf-8')
'abc?xyz'
But here I have a dataframe with N rows and M columns. It's fine for me to remove rows with surrogates, but it's not fine to skip the whole dataframe.
The problem is I don't know in which rows and in which columns they are.
I am looking for a solution that could be applied in following way
try:
data.to_csv(outp_file, encoding='utf-8')
except UnicodeEncodeError:
# process data and save it without surrogates...
Any help?
for col in train.columns:
if train[col].dtype==object:
train[col]=train[col].apply(lambda x: np.nan if x==np.nan else str(x).encode('utf-8', 'replace').decode('utf-8'))
Try this. It worked for me
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With