I'm reading a text file:
f = open('data.txt')
data = f.read()
However newline in data
variable is normalized to LF ('\n') while the file contains CRLF ('\r\n').
How can I instruct Python to read the file as is?
In Python 2.x:
f = open('data.txt', 'rb')
As the docs say:
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append
'b'
to the mode value to open the file in binary mode, which will improve portability. (Appending'b'
is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.)
In Python 3.x, there are three alternatives:
f1 = open('data.txt', 'rb')
This will leave newlines untransformed, but will also return bytes
instead of str
, which you will have to explicitly decode
to Unicode yourself. (Of course the 2.x version also returned bytes that had to be decoded manually if you wanted Unicode, but in 2.x that's what a str
object is; in 3.x str
is Unicode.)
f2 = open('data.txt', 'r', newline='')
This will return str
, and leave newlines untranslated. Unlike the 2.x equivalent, however, readline
and friends will treat '\r\n'
as a newline, instead of a regular character followed by a newline. Usually this won't matter, but if it does, keep it in mind.
f3 = open('data.txt', 'rb', encoding=locale.getpreferredencoding(False))
This treats newlines exactly the same way as the 2.x code, and returns str
using the same encoding you'd get if you just used all of the defaults… but it's no longer valid in current 3.x.
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
The reason you need to specify an explicit encoding for f3
is that opening a file in binary mode means the default changes from "decode with locale.getpreferredencoding(False)
" to "don't decode, and return raw bytes
instead of str
". Again, from the docs:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
However:
'encoding' … should only be used in text mode.
And, at least as of 3.3, this is enforced; if you try it with binary mode, you get ValueError: binary mode doesn't take an encoding argument
.
So, if you want to write code that works on both 2.x and 3.x, what do you use? If you want to deal in bytes
, obviously f
and f1are the same. But if you want to deal in
str, as appropriate for each version, the simplest answer is to write different code for each, probably
fand
f2`, respectively. If this comes up a lot, consider writing either wrapper function:
if sys.version_info >= (3, 0):
def crlf_open(path, mode):
return open(path, mode, newline='')
else:
def crlf_open(path, mode):
return open(path, mode+'b')
Another thing to watch out for in writing multi-version code is that, if you're not writing locale-aware code, locale.getpreferredencoding(False)
almost always returns something reasonable in 3.x, but it will usually just return 'US-ASCII'
in 2.x. Using locale.getpreferredencoding(True)
is technically incorrect, but may be more likely to be what you actually want if you don't want to think about encodings. (Try calling it both ways in your 2.x and 3.x interpreters to see why—or read the docs.)
Of course if you actually know the file's encoding, that's always better than guessing anyway.
In either case, the 'r'
means "read-only". If you don't specify a mode, the default is 'r'
, so the binary-mode equivalent to the default is 'rb'
.
You need to open the file in the binary mode:
f = open('data.txt', 'rb')
data = f.read()
('r'
for "read", 'b'
for "binary")
Then everything is returned as is, nothing is normalized
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With