Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handle surrogates with pandas

When saving the data

data.to_csv(outp_file, encoding='utf-8')

I sometimes get errors like this

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 233-234: surrogates not allowed

In python3 you can simply replace such characters

>> "abc\udc34xyz".encode('utf-8', 'replace').decode('utf-8')
'abc?xyz'

But here I have a dataframe with N rows and M columns. It's fine for me to remove rows with surrogates, but it's not fine to skip the whole dataframe.

The problem is I don't know in which rows and in which columns they are.

I am looking for a solution that could be applied in following way

try:
   data.to_csv(outp_file, encoding='utf-8')
except UnicodeEncodeError:
   # process data and save it without surrogates...

Any help?

like image 830
tarashypka Avatar asked Oct 31 '17 13:10

tarashypka


1 Answers

for col in train.columns:
    if train[col].dtype==object:
        train[col]=train[col].apply(lambda x: np.nan if x==np.nan else str(x).encode('utf-8', 'replace').decode('utf-8'))

Try this. It worked for me

like image 67
rahul yadav Avatar answered Sep 21 '22 12:09

rahul yadav