Open a file in the proper encoding automatically [duplicate]

Tags:

python

I'm dealing with some problems in a few files about the encoding. We receive files from other company and have to read them (the files are in csv format)

Strangely, the files appear to be encoded in UTF-16. I am managing to do that, but I have to open them using the codecs module and specifying the encoding, this way.

ENCODING = 'utf-16'
with codecs.open(test_file, encoding=ENCODING) as csv_file:
    # Autodetect dialect
    dialect = csv.Sniffer().sniff(descriptor.read(1024))
    descriptor.seek(0)
    input_file = csv.reader(descriptor, dialect=dialect)

    for line in input_file:
       do_funny_things()

But, just like I am able to get the dialect in a more agnostic way, I 'm thinking it will be great to have a way of opening automatically the files with its proper encoding, at least all the text files. There are other programs, like vim that achieve that.

Anyone knows a way of doing that in python 2.6?

PD: I hope that this will be solved in Python 3, as all the strings are Unicode...

642

asked Feb 26 '10 14:02

Khelben

2 Answers

chardet can help you.

Character encoding auto-detection in Python 2 and 3. As smart as your browser. Open source.

129

answered Oct 31 '22 23:10

Desintegr

It won't be "fixed" in python 3, as it's not a fixable problem. Many documents are valid in several encodings, so the only way to determine the proper encoding is to know something about the document. Fortunately, in most cases we do know something about the document, like for instance, most characters will come clustered into distinct unicode blocks. A document in english will mostly contain characters within the first 128 codepoints. A document in russian will contain mostly cyrillic codepoints. Most document will contain spaces and newlines. These clues can be used to help you make educated guesses about what encodings are being used. Better yet, use a library written by someone who's already done the work. (Like chardet, mentioned in another answer by Desintegr.

answered Oct 31 '22 23:10

jcdyer

Related questions
                            
                                ImportError: Can't find framework /System/Library/Frameworks/OpenGL.framework
                            
                                Join two DataFrames on common columns only if the difference in a separate column is within range [-n, +n]
                            
                                Why does python's Exception's repr keep track of passed object's to __init__?
                            
                                How to "unroll" time intervals in a dataframe?
                            
                                UTF in Python Regex
                            
                                Why csv.reader is not pythonic?
                            
                                How should I return interesting values from a with-statement?
                            
                                Can I make Python 2.5 exit on ctrl-D in Windows instead of ctrl-Z?
                            
                                Using pysmbc to read files over samba
                            
                                CherryPy interferes with Twisted shutting down on Windows
                            
                                Reading a website with asyncore
                            
                                Python memory leaks?
                            
                                Django slugified urls - how to handle collisions?
                            
                                Python: smarter way to calculate loan payments
                            
                                Improvizing a drop-in replacement for the "with" statement for Python 2.4
                            
                                Longest string in numpy object_ array
                            
                                How to get the html source of a specific element with selenium?
                            
                                Python: which XML parsing library will work out-of-the-box for Python 2.4 and up?
                            
                                python qt raise syntax error
                            
                                GUI not updated from another thread when using PyGtk

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With