Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python open("x", "r") function, how do I know or control which encoding the file is supposed to have?

If a python script uses the open("filename", "r") function to open, and subsequently read, the contents of a text file, how can I tell which encoding this file is supposed to have?

Note that since I'm executing this script from my own program, if there is any way to control this through environment variables, then that is good enough for me.

This is Python 2.7 by the way.

The code in question comes from Mercurial, it can be given a list of files to, say, add to the repository, through a file on disk, instead of passing them on the command line.

So basically, instead of this:

hg add A B C

I can write out A, B and C to a file, with newlines between each, and then execute the following:

hg add listfile:input.txt

The code that ends up reading this file is this:

files = open(name, 'r').read().split(delimiter)

Hence my question. The answer I was given on IRC when I asked which encoding I should use was this:

it is the same encoding than the one you use on command line when passing a file argument

I take this to mean that it is the same encoding I "use" when I execute Mercurial (hg). Since I have no idea which encoding that is, I just give everything to the .NET Process object, I ask here.

like image 538
Lasse V. Karlsen Avatar asked May 01 '11 16:05

Lasse V. Karlsen


People also ask

What is R in Python Open file?

Can be written this way: open(filename, 'r') where the 'r' means reading. Reading mode is the default, so the 'r' can be omitted as above. The mode 'w' is for file writing, shown below.

How do I open a UTF 8 file in Python?

Use open() to open a file with UTF-8 encoding Call open(file, encoding=None) with encoding as "UTF-8" to open file with UTF-8 encoding.

What does open () do in Python?

The open() function opens a file, and returns it as a file object.


1 Answers

You can't. Reading a file is independent of its encoding; you'll need to know the encoding in advance in order to properly interpret the bytes you read in.

For example, if you know the file is encoded in UTF-8:

with open('filename', 'rb') as f:
    contents = f.read().decode('utf-8-sig')    # -sig deals with BOM, if present

Or if you know the file is ASCII only:

with open('filename', 'r') as f:
    contents = f.read()    # results in a str object

If you really don't know the encoding of the file, then there's obviously no guarantee that you can read it properly; however, you can guess at the encoding using a tool like chardet.

UPDATE:

I think I understand your question now. I thought you had a file you needed to write code for, but it seems you have code you need to write a file for ;-)

The code in question probably only deals properly with plain ASCII (it's possible the strings are converted later, but unlikely I think). So you'll want to make a text file that contains only ASCII (codepoint < 128) characters, and make sure it is saved in an ASCII encoding (i.e. not UTF-16 or anything like that). This is a little unfortunate considering that Mercurial deals with filenames, which can contain Unicode characters.

like image 66
Cameron Avatar answered Sep 29 '22 07:09

Cameron