Remove utf-8 literals in a string python

Question

I'm new to python,I have a string like:

s= 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'

I want to remove all the unicode literals in a string like:

'\xc3\x82\xc2\xae'

I need output like:

'HDFC FTAE Greater China'

Can anyone help me with this?

Thank you

Mark Tolonen · Accepted Answer

On Python 2 (default string type is bytes):

>>> s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
>>> s.decode('ascii',errors='ignore').encode('ascii')
'HDCF FTAE Greater China'

On Python 3 (default string type is Unicode):

>>> s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
>>> s.encode('ascii',errors='ignore').decode('ascii')
'HDCF FTAE Greater China'

Note that the original string is a mojibake. Ideally fix how the string was read, but you can undo the damage with (Python 3):

>>> s.encode('latin1').decode('utf8').encode('latin1').decode('utf8')
'HDCF® FTAE® Greater China'

The original string was double-encoded as UTF-8. This works by converting the string directly 1:1 back to bytes¹, decoding as UTF-8, then converting directly back to bytes again and decoding with UTF-8 again.

Here's the Python 2 version:

>>> s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
>>> print s.decode('utf8').encode('latin1').decode('utf8')
HDCF® FTAE® Greater China

¹This works because the latin1 codec is a 256-byte encoding and directly maps to the first 256 Unicode codepoints.

Remove utf-8 literals in a string python

Tags:

python

string

hex

unicode

utf-8

Narendra Kamatham

1 Answers

Mark Tolonen

Recent Activity

Donate For Us

Remove utf-8 literals in a string python

Tags:

python

string

hex

unicode

utf-8

Narendra Kamatham

1 Answers

Mark Tolonen

Related questions

Recent Activity

Donate For Us