I read this on Python tutorial: (http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files)
Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files.
I don't quite understand how 'end-of-line characters in text files are altered' will 'corrupt binary data'. Because I feel binary data don't have such things like end-of-line.
Can somebody explain more of this paragraph for me? It's making me feel like Python doesn't welcome binary files.
You just have to take care to open files on windows as binary (open(filename, "rb")
) and not as text files. After that there is no problem using the data.
Particularly the end-of-line on Windows is '\r\n'
. And if you read a binary file as text file and write it back out, then single '\n'
are transformed in '\r\n'
sequences. If you open the files as binary (for reading and for writing) there is no such problem.
Python is capable of dealing with binary data, and you would have to take this kind of care in any language on the windows systems, not just in Python (but the developers of Python are friendly enough to warn you of possible OS problems). In systems like Linux where the end-of-line is a single character this distinction exists as well, but is less likely to cause a problem when reading/writing binary data as text (i.e. without the b
option for opening of files).
I feel binary data don't have such things like end-of-line.
Binary files can have ANY POSSIBLE character in them, including the character \n. You do not want python implicitly converting any characters in a binary file to something else. Python has no idea it is reading a binary file unless you tell it so. And when python reads a text file it automatically converts any \n character to the OS's newline character, which on Windows is \r\n.
That is the way things work in all computer programming languages.
Another way to think about it is: a file is just a long series of bytes (8 bits). A byte is just an integer. And a byte can be any integer. If a byte happens to be the integer 10, that is also the ascii code for the character \n. If the bytes in the file represent binary data, you don't want Python to read in 10 and convert it to two bytes: 13 and 10. Usually when you read binary data, you want to read, say, the first 2 bytes which represents a number, then the next 4 bytes which represent another number, etc.. Obviously, if python suddenly converts one of the bytes to two bytes, that will cause two problems: 1) It alters the data, 2) All your data boundaries will be messed up.
An example: suppose the first byte of a file is supposed to represent a dog's weight, and the byte's value is 10. Then the next byte is supposed to represent the dog's age, and its value is 1. If Python converts the 10, which is the ascii code for \n, to two bytes: 10 and 13, then the data python hands you will look like:
10 13 1
And when you extract the second byte for the dog's age, you get 13--not 1.
We often say a file contains 'characters' but that is patently false. Computers cannot store characters; they can only store numbers. So a file is just a long series of numbers. If you tell python to treat those numbers as ascii codes, which represent characters, then python will give you text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With