How to convert \xXY encoded characters to UTF-8 in Python?

Question

I have a text which contains characters such as "\xaf", "\xbe", which, as I understand it from this question, are ASCII encoded characters.

I want to convert them in Python to their UTF-8 equivalents. The usual string.encode("utf-8") throws UnicodeDecodeError. Is there some better way, e.g., with the codecs standard library?

Sample 200 characters here.

tzot · Accepted Answer

Your file is already a UTF-8 encoded file.

# saved encoding-sample to /tmp/encoding-sample
import codecs
fp= codecs.open("/tmp/encoding-sample", "r", "utf8")
data= fp.read()

import unicodedata as ud

chars= sorted(set(data))
for char in chars:
    try:
        charname= ud.name(char)
    except ValueError:
        charname= "<unknown>"
    sys.stdout.write("char U%04x %s
" % (ord(char), charname))

And manually filling in the unknown names:
char U000a LINE FEED
char U001e INFORMATION SEPARATOR TWO
char U001f INFORMATION SEPARATOR ONE

How to convert \xXY encoded characters to UTF-8 in Python?

Tags:

python

character-encoding

unicode

utf-8

non-ascii-characters

Jindřich Mynarz

1 Answers

tzot

Recent Activity

Donate For Us

How to convert \xXY encoded characters to UTF-8 in Python?

Tags:

python

character-encoding

unicode

utf-8

non-ascii-characters

Jindřich Mynarz

1 Answers

tzot

Related questions

Recent Activity

Donate For Us