It seems Python's UTF-8 encoding (codecs
package) interprets Unicode characters 28, 29, and 30 as line endings. Why? And how can I prevent it from doing so?
Example code:
with open('unicodetest.txt', 'w') as f:
f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e')
with open('unicodetest.txt', 'r') as f:
for i,l in enumerate(f):
print i, l
# prints "0 abcde" with special characters in between.
The point here is that it reads it as one line as I expect it to do. Now when I use codecs
to read it in UTF-8, it interprets it as many lines.
import codecs
with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f:
for i,l in enumerate(f):
print i, l
# 0 a
# 1 b
# 2 c
# 3 de
# (again with the special characters after each a, b, c, d
The characters 28 through 31 are described as "Information Separator Four" through "One" (in that order). Two things strike me: 1) 28 to 30 are interpreted as line ends, 2) 31 is not. Is this intended behaviour? Where can I find a definition of which characters are interpreted as line ends? Is there a way to not interpret them as line ends?
Thanks.
edit forgot to copy the 'UTF-8' argument in codecs.open
. The code in my question is now corrected.
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.
Codec Objects. Encodes the object input and returns a tuple (output object, length consumed). Encoding converts a string object to a bytes object using a particular character set encoding (e.g., cp1252 or iso-8859-1). errors defines the error handling to apply.
UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.
This is a great question.
It makes a difference whether you open a file with open()
or codecs.open()
. The former operates in terms of byte strings. The latter operates in terms of Unicode strings. In Python, these behave differently.
This same question came up as Python Issue 7643, What is a Unicode line break character?. The discussion, and the citations to the Unicode Character Database, are fascinating. Issue 7643 also gives this concise code snippet to demonstrate of the difference:
for s in '\x0a\x0d\x1c\x1d\x1e':
print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)
But it boils down to this.
To determine if bytes in byte strings are line breaks (or whitespace), Python uses the rules of ASCII control characters. By that measure, bytes 10 and 13 are line break characters (and Python treats byte 13 followed by 10 as a single line break).
But to determine if characters in Unicode strings are line breaks, Python follows the character classifications of the Unicode Character Database, documented at UAX #44, and of the UAX #14 Line Breaking Algorithm, section 5 Line Breaking Properties. According to Issue 7643, these documents identify three character properties which identify a character as a linebreak for Python's purposes:
Characters 28 (0x001C), 29 (0x001D), and 30 (0x001E) have those character properties. Character 31 (0x001F) does not. Why? That's a question for the Unicode Technical Committee. But in ASCII, these characters were known as "File Separator", "Group Separator", "Record Separator", and "Unit Separator". Using a tabbed text data file as a comparison, the first three connote at least as much separation as a line break does, while the fourth is perhaps analogous to the tab.
You can see the code which actually defines these three Unicode characters as being line breaks in Python Unicode strings in Objects/unicodeobject.c
. Look for array ascii_linebreak[]
. This array underlies the implementation of unicode.splitlines()
. Different code underlies str.splitlines()
. I believe, but haven't traced it in the Python source code, that enumerate()
on a file opened with codecs.open()
is implemented in terms of unicode.splitlines()
.
You ask, "how can I prevent it from doing so?" I don't see any way to make splitlines()
behave differently. However, you can open the file as a byte stream, read lines as bytes with the str.splitlines()
behaviour, then decode each line as UTF-8 for use as a unicode string:
with open('unicodetest.txt', 'r') as f:
for i,l in enumerate(f):
print i, l.decode('UTF-8')
# prints "0 abcde" with special characters in between.
I assume you are using Python 2.x, not 3.x. My answer is based on Python 2.7.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With