Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Line reading chokes on 0x1A

I have the following file:

abcde
kwakwa
<0x1A>
line3
linllll

Where <0x1A> represents a byte with the hex value of 0x1A. When attempting to read this file in Python as:

for line in open('t.txt'):
    print line,

It only reads the first two lines, and exits the loop.

The solution seems to be to open the file in binary (or universal newline mode) - 'rb' or 'rU'. Can you explain this behavior ?

like image 461
Eli Bendersky Avatar asked Jan 01 '09 15:01

Eli Bendersky


2 Answers

0x1A is Ctrl-Z, and DOS historically used that as an end-of-file marker. For example, try using a command prompt, and "type"ing your file. It will only display the content up the Ctrl-Z.

Python uses the Windows CRT function _wfopen, which implements the "Ctrl-Z is EOF" semantics.

like image 55
Ned Batchelder Avatar answered Oct 23 '22 03:10

Ned Batchelder


Ned is of course correct.

If your curiosity runs a little deeper, the root cause is backwards compatibility taken to an extreme. Windows is compatible with DOS, which used Ctrl-Z as an optional end of file marker for text files. What you might not know is that DOS was compatible with CP/M, which was popular on small computers before the PC. CP/M's file system didn't keep track of file sizes down to the byte level, it only kept track by the number of floppy disk sectors. If your file wasn't an exact multiple of 128 bytes, you needed a way to mark the end of the text. This Wikipedia article implies that the selection of Ctrl-Z was based on an even older convention used by DEC.

like image 40
Mark Ransom Avatar answered Oct 23 '22 04:10

Mark Ransom