Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I open UTF-16 files on Python 2.x?

Tags:

python

unicode

I'm working on a Python tool that must be able to open files of UTF-8 and UTF-16 encoding. In Python 3.2, I use the following code to try opening the file using UTF-8, then try it with UTF-16 if there's a unicode error:

def readGridFromPath(self, filepath):
    try:
        self.readGridFromFile(open(filepath,'r',encoding='utf-8'))
    except UnicodeDecodeError:
            self.readGridFromFile(open(filepath,'r',encoding='utf-16'))

(readGridFromFile will either run through to completion, or raise a UnicodeDecodeError. )

However, when I run this code in Python 2.x, I get:

TypeError: 'encoding' is an invalid keyword argument for this function

I see in the docs that Python 2.x's open() doesn't have an encoding keyword. Is there any way around this that will allow me to make my code Python 2.x compatible?

like image 813
stalepretzel Avatar asked Apr 07 '12 21:04

stalepretzel


People also ask

How do I open an encoded file in Python?

Open a Text File To open a file, you can use Python's built-in open() function. Inside the open() function parentheses, you insert the filepath to be opened in quotation marks. You should also insert a character encoding, which we will talk more about below. This function returns what's called a file object.

Is Python a UTF-16?

Python 3.3 no longer ever uses UTF-16, instead an encoding that gives the most compact representation for the given string is chosen from ASCII/Latin-1, UCS-2, and UTF-32. Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0. JavaScript may use UCS-2 or UTF-16.

Is UTF-16 same as Unicode?

UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.


1 Answers

io.open is drop-in replacement for your needs, so code sample you've provided will look as follows in Python 2.x:

import io

def readGridFromPath(self, filepath):
    try:
        self.readGridFromFile(io.open(filepath, 'r', encoding='utf-8'))
    except UnicodeDecodeError:
        self.readGridFromFile(io.open(filepath, 'r', encoding='utf-16'))


io.open is described here in detail. Its prototype is:

io.open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True)

io module itself was designed as compatibility layer between Python 2.x and Python 3.x, to ease transition to Py3k and simplify back-porting and maintenance of existing Python 2.x code.

Also, please note that there can be a caveat using codecs.open, as it works in binary mode only:

Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n'` is done on reading and writing.

Also you may run into issues of manually detecting and stripping out UTF8 BOM — codecs.open leaves UTF8 BOM inline as u'\ufeff' character.

like image 146
toriningen Avatar answered Sep 20 '22 01:09

toriningen