I have a text file with first line of unicode characters and all other lines in ASCII. I try to read the first line as one variable, and all other lines as another. However, when I use the following code:
# -*- coding: utf-8 -*-
import codecs
import os
filename = '1.txt'
f = codecs.open(filename, 'r3', encoding='utf-8')
print f
names_f = f.readline().split(' ')
data_f = f.readlines()
print len(names_f)
print len(data_f)
f.close()
print 'And now for something completely differerent:'
g = open(filename, 'r')
names_g = g.readline().split(' ')
print g
data_g = g.readlines()
print len(names_g)
print len(data_g)
g.close()
I get the following output:
<open file '1.txt', mode 'rb' at 0x01235230>
28
7
And now for something completely differerent:
<open file '1.txt', mode 'r' at 0x017875A0>
28
77
If I don't use readlines(), whole file reads, not only first 7 lines both at codecs.open() and open().
Why does such thing happen? And why does codecs.open() read file in binary mode, despite the 'r' parameter is added?
Upd: This is original file: http://www1.datafilehost.com/d/0792d687
Because you used .readline() first, the codecs.open() file has filled a linebuffer; the subsequent call to .readlines() returns only the buffered lines.
If you call .readlines() again, the rest of the lines are returned:
>>> f = codecs.open(filename, 'r3', encoding='utf-8')
>>> line = f.readline()
>>> len(f.readlines())
7
>>> len(f.readlines())
71
The work-around is to not mix .readline() and .readlines():
f = codecs.open(filename, 'r3', encoding='utf-8')
data_f = f.readlines()
names_f = data_f.pop(0).split(' ')  # take the first line.
This behaviour is really a bug; the Python devs are aware of it, see issue 8260.
The other option is to use io.open() instead of codecs.open(); the io library is what Python 3 uses to implement the built-in open() function and is a lot more robust and versatile than the codecs module.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With