How to read the file contents from a file?

Question

Using Python3, hope to os.walk a directory of files, read them into a binary object (string?) and do some further processing on them. First step, though: How to read the file(s) results of os.walk?

# NOTE: Execute with python3.2.2

import os
import sys

path = "/home/user/my-files"

count = 0
successcount = 0
errorcount = 0
i = 0

#for directory in dirs
for (root, dirs, files) in os.walk(path):
 # print (path)
 print (dirs)
 #print (files)

 for file in files:

   base, ext = os.path.splitext(file)
   fullpath = os.path.join(root, file)

   # Read the file into binary? --------
   input = open(fullpath, "r")
   content = input.read()
   length = len(content)
   count += 1
   print ("    file: ---->",base," / ",ext," [count:",count,"]",  "[length:",length,"]")
   print ("fullpath: ---->",fullpath)

ERROR:

Traceback (most recent call last):
  File "myFileReader.py", line 41, in <module>
    content = input.read()
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 11: invalid continuation byte

Lennart Regebro · Accepted Answer

To read a binary file you must open the file in binary mode. Change

input = open(fullpath, "r")

to

input = open(fullpath, "rb")

The result of the read() will be a bytes() object.

Cory Dolphin · Answer

As some of your files are binary, they cannot be successfully decoded into unicode characters that Python 3 uses to store all strings in the interpreter. Note a large change between Python 2 and Python 3 involves the migration of the representation of Strings to unicode characters from ASCII, which means that each character cannot simply be treated as a byte (yes, text strings in Python 3 require either 2x or 4x as much memory to store as Python 2, as UTF-8 uses up to 4 bytes per character).

You thus have a number of options that will depend upon your project:

Ignore binary files, filtering by the file extension,
Read the binary files and either catch the decoding exception if and when it occurs, and skip the file, or use one of the method described in this thread How can I detect if a file is binary (non-text) in python?

In this vein, you may edit your solution to simply catch the UnicodeDecode error and skip the file.

Regardless of your decision, it is important to note that if there is a wide range of different character encodings in the files on your system, you will need to specify the encoding as Python 3.0 will assume the characters are encoded in UTF-8.

As a reference, a great presentation on Python 3 I/O: http://www.dabeaz.com/python3io/MasteringIO.pdf

How to read the file contents from a file?

Tags:

python

python-3.x

os.walk

DrLou

2 Answers

Lennart Regebro

Cory Dolphin

Recent Activity

Donate For Us

How to read the file contents from a file?

Tags:

python

python-3.x

os.walk

DrLou

2 Answers

Lennart Regebro

Cory Dolphin

Related questions

Recent Activity

Donate For Us