Python thinks a 3000-line text file is one line long?

Question

I have a very long text file that I'm trying to process using Python.

However, the following code:

for line in open('textbase.txt', 'r'):
    print 'hello world'

produces only the following output:

hello world

It's as though Python thinks the file is only one line long, though it is many thousands of lines long, when viewed in a text editor. Examining it on the command line using the file command gives:

$ file textbase.txt
textbase.txt: Big-endian UTF-16 Unicode English text, with CR line terminators

Is something wrong? Do I need to change the line terminators?

Josh Lee · Accepted Answer

According to the documentation for open(), you should add a U to the mode:

open('textbase.txt', 'Ur')

This enables "universal newlines", which normalizes them to in the strings it gives you.

However, the correct thing to do is to decode the UTF-16BE into Unicode objects first, before translating the newlines. Otherwise, a chance 0x0d byte could get erroneously turned into a 0x0a, resulting in

UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 12: truncated data.

Python's codecs module supplies an open function that can decode Unicode and handle newlines at the same time:

import codecs
for line in codecs.open('textbase.txt', 'Ur', 'utf-16be'):
    ...

If the file has a byte order mark (BOM) and you specify 'utf-16', then it detects the endianness and hides the BOM for you. If it does not (since the BOM is optional), then that decoder will just go ahead and use your system's endianness, which probably won't be good.

Specifying the endianness yourself (with 'utf-16be') will not hide the BOM, so you might wish to use this hack:

import codecs
firstline = True
for line in codecs.open('textbase.txt', 'Ur', 'utf-16be'):
    if firstline:
        firstline = False
        line = line.lstrip(u'\ufeff')

See also: Python Unicode HOWTO

paxdiablo · Answer

You'll probably find it's the "with CR line terminators" that gives the game away. If you're working on a platform that uses newlines as line terminators, it will see your file as one big honkin' line.

Change your input file so that it uses the correct line terminators. Your editor is probably more forgiving than your Python implementation.

The CR line endings are a Mac thing as far as I'm aware and you can use the U mode modifier to open to auto-detect based on the first line terminator found.

Python thinks a 3000-line text file is one line long?

Tags:

python

text

newline

character-encoding

AP257

Video Answer

2 Answers

Josh Lee

paxdiablo

Recent Activity

Donate For Us

Python thinks a 3000-line text file is one line long?

Tags:

python

text

newline

character-encoding

AP257

Video Answer

2 Answers

Josh Lee

paxdiablo

Related questions

Recent Activity

Donate For Us