Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Don't convert newline when reading a file

Tags:

python

I'm reading a text file:

f = open('data.txt')
data = f.read()

However newline in data variable is normalized to LF ('\n') while the file contains CRLF ('\r\n').

How can I instruct Python to read the file as is?

like image 214
theta Avatar asked Nov 29 '22 02:11

theta


2 Answers

In Python 2.x:

f = open('data.txt', 'rb')

As the docs say:

The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.)

In Python 3.x, there are three alternatives:

f1 = open('data.txt', 'rb')

This will leave newlines untransformed, but will also return bytes instead of str, which you will have to explicitly decode to Unicode yourself. (Of course the 2.x version also returned bytes that had to be decoded manually if you wanted Unicode, but in 2.x that's what a str object is; in 3.x str is Unicode.)

f2 = open('data.txt', 'r', newline='')

This will return str, and leave newlines untranslated. Unlike the 2.x equivalent, however, readline and friends will treat '\r\n' as a newline, instead of a regular character followed by a newline. Usually this won't matter, but if it does, keep it in mind.

f3 = open('data.txt', 'rb', encoding=locale.getpreferredencoding(False))

This treats newlines exactly the same way as the 2.x code, and returns str using the same encoding you'd get if you just used all of the defaults… but it's no longer valid in current 3.x.

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.

The reason you need to specify an explicit encoding for f3 is that opening a file in binary mode means the default changes from "decode with locale.getpreferredencoding(False)" to "don't decode, and return raw bytes instead of str". Again, from the docs:

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)

However:

'encoding' … should only be used in text mode.

And, at least as of 3.3, this is enforced; if you try it with binary mode, you get ValueError: binary mode doesn't take an encoding argument.

So, if you want to write code that works on both 2.x and 3.x, what do you use? If you want to deal in bytes, obviously f and f1are the same. But if you want to deal instr, as appropriate for each version, the simplest answer is to write different code for each, probablyfandf2`, respectively. If this comes up a lot, consider writing either wrapper function:

if sys.version_info >= (3, 0):
    def crlf_open(path, mode):
        return open(path, mode, newline='')
else:
    def crlf_open(path, mode):
        return open(path, mode+'b')

Another thing to watch out for in writing multi-version code is that, if you're not writing locale-aware code, locale.getpreferredencoding(False) almost always returns something reasonable in 3.x, but it will usually just return 'US-ASCII' in 2.x. Using locale.getpreferredencoding(True) is technically incorrect, but may be more likely to be what you actually want if you don't want to think about encodings. (Try calling it both ways in your 2.x and 3.x interpreters to see why—or read the docs.)

Of course if you actually know the file's encoding, that's always better than guessing anyway.

In either case, the 'r' means "read-only". If you don't specify a mode, the default is 'r', so the binary-mode equivalent to the default is 'rb'.

like image 115
abarnert Avatar answered Dec 10 '22 10:12

abarnert


You need to open the file in the binary mode:

f = open('data.txt', 'rb')
data = f.read()

('r' for "read", 'b' for "binary")

Then everything is returned as is, nothing is normalized

like image 33
bereal Avatar answered Dec 10 '22 11:12

bereal