Handle surrogates with pandas

Question

When saving the data

data.to_csv(outp_file, encoding='utf-8')

I sometimes get errors like this

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 233-234: surrogates not allowed

In python3 you can simply replace such characters

>> "abc\udc34xyz".encode('utf-8', 'replace').decode('utf-8')
'abc?xyz'

But here I have a dataframe with N rows and M columns. It's fine for me to remove rows with surrogates, but it's not fine to skip the whole dataframe.

The problem is I don't know in which rows and in which columns they are.

I am looking for a solution that could be applied in following way

try:
   data.to_csv(outp_file, encoding='utf-8')
except UnicodeEncodeError:
   # process data and save it without surrogates...

Any help?

rahul yadav · Accepted Answer

for col in train.columns:
    if train[col].dtype==object:
        train[col]=train[col].apply(lambda x: np.nan if x==np.nan else str(x).encode('utf-8', 'replace').decode('utf-8'))

Try this. It worked for me

Handle surrogates with pandas

Tags:

python-3.x

pandas

tarashypka

1 Answers

rahul yadav

Recent Activity

Donate For Us

Handle surrogates with pandas

Tags:

python-3.x

pandas

tarashypka

1 Answers

rahul yadav

Related questions

Recent Activity

Donate For Us