Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading UTF-8 Encoded Files and Text Files in Python3

Ok, so python3 and unicode. I know that all python3 strings are actually unicode strings and all python3 code is stored as utf-8. But how does python3 reads text files? Does it assume that they are encoded in utf-8? Do I need to call decode('utf-8') when reading a text file? What about pandas read_csv() and to_csv()?

like image 398
Bella Dubrov Avatar asked Dec 22 '17 23:12

Bella Dubrov


People also ask

How do I read a UTF-8 file?

Instantiate the FileInputStream class by passing a String value representing the path of the required file, as a parameter. Instantiate the DataInputStream class bypassing the above created FileInputStream object as a parameter. read UTF data from the InputStream object using the readUTF() method.

How do I check if a file is UTF-8 encoded in Python?

Could be simpler by using only one line: codecs. open("path/to/file", encoding="utf-8", errors="strict").

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.


1 Answers

Python's built-in function open() has an optional parameter encoding:

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

Analogous parameter could be found in pandas:

  • pandas.read_csv(): encoding: str, default None. Encoding to use for UTF when reading/writing (ex. ‘utf-8’).
  • Series.to_csv(): encoding: string, optional. A string representing the encoding to use if the contents are non-ascii, for python versions prior to 3.
  • DataFrame.to_csv(): encoding: string, optional. A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
like image 193
JosefZ Avatar answered Dec 20 '22 19:12

JosefZ