Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read the file contents from a file?

Using Python3, hope to os.walk a directory of files, read them into a binary object (string?) and do some further processing on them. First step, though: How to read the file(s) results of os.walk?

# NOTE: Execute with python3.2.2

import os
import sys

path = "/home/user/my-files"

count = 0
successcount = 0
errorcount = 0
i = 0

#for directory in dirs
for (root, dirs, files) in os.walk(path):
 # print (path)
 print (dirs)
 #print (files)

 for file in files:

   base, ext = os.path.splitext(file)
   fullpath = os.path.join(root, file)

   # Read the file into binary? --------
   input = open(fullpath, "r")
   content = input.read()
   length = len(content)
   count += 1
   print ("    file: ---->",base," / ",ext," [count:",count,"]",  "[length:",length,"]")
   print ("fullpath: ---->",fullpath)

ERROR:

Traceback (most recent call last):
  File "myFileReader.py", line 41, in <module>
    content = input.read()
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 11: invalid continuation byte
like image 203
DrLou Avatar asked Dec 28 '11 23:12

DrLou


2 Answers

To read a binary file you must open the file in binary mode. Change

input = open(fullpath, "r")

to

input = open(fullpath, "rb")

The result of the read() will be a bytes() object.

like image 192
Lennart Regebro Avatar answered Sep 23 '22 16:09

Lennart Regebro


As some of your files are binary, they cannot be successfully decoded into unicode characters that Python 3 uses to store all strings in the interpreter. Note a large change between Python 2 and Python 3 involves the migration of the representation of Strings to unicode characters from ASCII, which means that each character cannot simply be treated as a byte (yes, text strings in Python 3 require either 2x or 4x as much memory to store as Python 2, as UTF-8 uses up to 4 bytes per character).

You thus have a number of options that will depend upon your project:

  • Ignore binary files, filtering by the file extension,
  • Read the binary files and either catch the decoding exception if and when it occurs, and skip the file, or use one of the method described in this thread How can I detect if a file is binary (non-text) in python?

In this vein, you may edit your solution to simply catch the UnicodeDecode error and skip the file.

Regardless of your decision, it is important to note that if there is a wide range of different character encodings in the files on your system, you will need to specify the encoding as Python 3.0 will assume the characters are encoded in UTF-8.

As a reference, a great presentation on Python 3 I/O: http://www.dabeaz.com/python3io/MasteringIO.pdf

like image 24
Cory Dolphin Avatar answered Sep 23 '22 16:09

Cory Dolphin