Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a text file with non-ASCII characters in an unknown encoding

I want to read a file that contains also German and not only characters. I found that i can do like this

  >>> import codecs
  >>> file = codecs.open('file.txt','r', encoding='UTF-8')
  >>> lines= file.readlines()

This is working when i try to run my job in Python IDLE but when i try to run it from somewhere else does not give correct result. Have a idea?

like image 378
indiag Avatar asked Jun 18 '12 16:06

indiag


People also ask

How do I read non-ASCII characters in Python?

Approach 1: This approach is related to the inbuilt library unidecode. This library helps Transliterating non-ASCII characters in Python. It provides an unidecode() method that takes Unicode data and tries to represent it in ASCII.


1 Answers

You need to know which character encoding the text is encoded in. If you don't know that beforehand, you can try guessing it with the chardet module. First install it:

$ pip install chardet

Then, for example reading the file in binary mode:

>>> import chardet
>>> chardet.detect(open("file.txt", "rb").read())
{'confidence': 0.9690625, 'encoding': 'utf-8'}

So then:

>>> import codecs
>>> import unicodedata
>>> lines = codecs.open('file.txt', 'r', encoding='utf-8').readlines()
like image 125
Chewie Avatar answered Oct 02 '22 21:10

Chewie