Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"surrogateescape" cannot escape certain characters

Regarding reading and writing text files in Python, one of the main Python contributors mentions this regarding the surrogateescape Unicode Error Handler:

[surrogateescape] handles decoding errors by squirreling the data away in a little used part of the Unicode code point space. When encoding, it translates those hidden away values back into the exact original byte sequence that failed to decode correctly.

However, while opening a file and then attempting to write the output to another file:

input_file = open('someFile.txt', 'r', encoding="ascii", errors="surrogateescape")
output_file = open('anotherFile.txt', 'w')

for line in input_file:
    output_file.write(line)

Results in:

  File "./break-50000.py", line 37, in main
    output_file.write(line)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 3: surrogates not allowed

Note that the input file is not ASCII. However, it transverses hundreds of lines that contain non-ASCII characters just fine before it throws the exception on one particular line. The output file must be ASCII and loosing some characters is just fine.

This is the line that is throwing the error when decoded as UTF-8:

'Zoë\'s Coffee House'

This is the hex encoding:

$ cat z.txt | hd
00000000  27 5a 6f c3 ab 5c 27 73  20 43 6f 66 66 65 65 20  |'Zo..\'s Coffee |
00000010  48 6f 75 73 65 27 0a                              |House'.|
00000017

Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII? This is with Python 3.2.3 on Kubuntu Linux 12.10.

like image 945
dotancohen Avatar asked Feb 18 '26 08:02

dotancohen


1 Answers

Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII?

Because that's what it explicitly does. That way you can use the same error handler the other way and it will know what to do.

3>> b"'Zo\xc3\xab\\'s'".decode('ascii', errors='surrogateescape')
"'Zo\udcc3\udcab\\'s'"
3>> "'Zo\udcc3\udcab\\'s'".encode('ascii', errors='surrogateescape')
b"'Zo\xc3\xab\\'s'"
like image 138
Ignacio Vazquez-Abrams Avatar answered Feb 20 '26 22:02

Ignacio Vazquez-Abrams