Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3: Read UTF-8 file containing German umlaut

I searched and found many similar questions and articles but none would allow me to resolve the issue.

I use Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37) [MSC v.1900 64 bit (AMD64)] on Windows 10.

I have a simple text file which is encoded for Windows in UTF-8 like so: enter image description here

All I want to do is to read the content of this file into a Python string and display it correctly in, say, the standard console.

Here is a first attempt that fails miserably:

    file_name=r'c:\temp\encoding_test.txt'
    fh=open(file_name,'r')
    f_str=fh.read()
    fh.close()
    print(f_str)

The print-statement raises an exception:

'charmap' codec can't encode character '\u201e' in position 100: character maps to undefined

Using a debugger, f_str contains the following:

'I would like the following characters to display correctly after reading this file into Python:\n\nÄÖÜäöüß\n'

This is already very puzzling to me. Doesn't Python 3 use UTF-8 as a default everywhere? What other encoding would work? I tried all of the ones Notepad++ supports, none works.

OK, a bit more sophisticated, I tried:

    import codecs
    file_name=r'c:\temp\encoding_test.txt'
    my_encoding='utf-8'
    fh=codecs.open(file_name,'r',encoding=my_encoding)
    f_str=fh.read().encode(my_encoding)
    fh.close()
    print(f_str)

This does not raise an exception, at least, but yields

b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n' I

This is a complete mess to me. Can anyone here please help me sort this out?

like image 754
user5507059 Avatar asked Dec 02 '22 15:12

user5507059


2 Answers

The accepted answer is too complicated. You just need to specify encoding for open:

fh = open(file_name, encoding='utf8')

And everything works without problems.

The answer to your other question:

Doesn't Python 3 use UTF-8 as a default everywhere?

Is "not when communicating with the external world (filesystem in this case), because it would be inconsistent with your OS". Specification says user's preferred encoding is locale dependent. Do

>>> import locale
>>> locale.getpreferredencoding()

to see what it is on your system - most likely "cp something" on Windows, depending on the exact default codepage set. But you can always override with explicit encoding argument to open.

There, I hope you learned something new. :-)

like image 200
Veky Avatar answered Dec 11 '22 12:12

Veky


You are encoding to bytes after using codecs.open , just printing the data should give you want as you can see when we decode back:

In [31]: s = b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'

In [32]: print(s)
b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'

In [33]: print(s.decode("utf-8"))
I would like the following characters to display correctly after reading this file into Python:

ÄÖÜäöüß

If you are not seeing the correct output then it is your shell encoding that is the problem. The windows console encoding is not utf-8 so where you are running the code from and the shell encoding matters.

like image 34
Padraic Cunningham Avatar answered Dec 11 '22 10:12

Padraic Cunningham