Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python loading 'utf-16' file can't decode '\u0153'

I have a text file encoded as utf-16 which throws an exception for the following character: '\u0153'.

UnicodeEncodeError: 'charmap' codec can't encode character '\u0153' in position

I'm using a very simple script to load the file, and I also tried ignoring errors without avail. What am I doing wrong?

with open(filename, "r", encoding="utf-16", errors='replace') as data_file:    
    print(data_file.read())

This is part of the file which breaks:

["Xinhua","Ürümqi"]

EDIT: No idea why my question is misunderstood. Hopefully this is better formed.

How should I read this file with Python?

Sample file link (UTF-16-LE file) containing:

["Xinhua","Ürümqi"]

Why doesn't this code work?

with open(filename, "r", encoding="utf-16", errors='replace') as data_file:    
    print(data_file.read())
like image 734
Ropstah Avatar asked Mar 03 '15 02:03

Ropstah


1 Answers

The exception that originally stumped you is because you're running Python inside a terminal emulator (or possibly "console window" is a more familiar term?) that can't display all of the characters in Unicode. To fix that you need to get yourself a Unicode-capable terminal emulator, and then ensure Python knows it's running inside a Unicode-capable terminal emulator. If you have no idea how to do that, ask a new question on superuser.com, specifying your operating system.

My terminal emulator can display all of the characters in Unicode, assuming all the necessary fonts are available, and Python knows that, so I can do this and not get an exception:

>>> with open("countryCity2.json", "r", encoding="utf-16") as f:
...   x = f.read()
... 
>>> print(x)
["Xinhua","Ürümqi"]

However, that is not your only problem. Your input file has had its encoding mangled. Ãœrümqi is not a sequence of characters that makes sense in any language. However, it conforms to the characteristic mojibake pattern of text that has been converted from a legacy encoding to UTF-8, and then — incorrectly — converted into a Unicode encoding again. We can test this by converting it 1:1 to bytes and seeing if we get a valid UTF-8 byte sequence:

>>> print(x.encode("iso-8859-1").decode("utf-8"))
["Xinhua","Ürümqi"]

"Ürümqi" is a real word and would plausibly appear in conjunction with "Xinhua". Also, if the text weren't mis-converted UTF-8, we would have seen an exception:

>>> "Ürümqi".encode("iso-8859-1").decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 0:
  invalid continuation byte

So the hypothesis is confirmed.

In a program that had to deal with a great many files whose encodings might or might not have been mangled in this way, I would do something like this:

for fname in input_files:
    with open(fname, "r", encoding="utf-16") as f:
        contents = f.read()
    try:
        contents = contents.encode("iso-8859-1").decode("utf-8")
    except (UnicodeEncodeError, UnicodeDecodeError):
        pass
    process_file(fname, contents)

I am using the ISO 8859.1 encoding here not because the text is or was ever actually in that encoding, but because Python's iso-8859-1 codec is an identity mapping from characters U+0000..U+00FF to bytes 0x00..0xFF. (Technically, that means it implements IANA ISO_8859-1:1987 instead of the original ECMA-94:1985 codepage, which left the 0x00..0x1F and 0x7F..0x9F ranges undefined.) That is,

>>> "".join(chr(c) for c in range(256)).encode('iso-8859-1') == bytes(range(256))
True

Therefore, any time you have binary data that has been mis-converted into Unicode, you can recover the original with .encode('iso-8859-1').

NOTE: All code snippets above are Python 3.

like image 119
zwol Avatar answered Sep 22 '22 21:09

zwol